xmlguru.cz

Spreading the XML paradigm around

DocBook, translations and ITS

2006-03-15

Many content stored in DocBook is being translated into other languages. The translation is a complex problem and it is very expensive. One small issue in every automated translation system is the need for specifying content to translate, and not to translate. W3C is developing small XML vocabulary called ITS which can be used to identify translatable content. Lets see how we can integrate this into DocBook.


Table of Contents

Integrating ITS into DocBook
My comments about ITS

I looked into ITS because one DocBook user had requirement for identifying text that should not be translated. The basic idea of ITS is very simple. You can use attribute translate (in ITS namespace) to specify whether some content should be translated or not.

<para its:translate="no">Nontranslateable content</para>

There is also documentRules element which can be used to define general rules for selecting elements or attributes (based on XPath pattern) to translate/not to translate.

Integrating ITS into DocBook

As DocBook V5.0 is really very easy to customize, adding ITS support into DocBook was matter of several minutes. Fortunately, W3C provides RELAX NG schema for ITS (its.rnc) so we can include it into our schema together with the base DocBook schema. Then we just need to add ITS data categories attributes to every DocBook element. This is very easy as there is a named pattern db.common.attributes which defines attributes available on every DocBook element. I also allowed documentRules element to appear inside info element. The new DocBook+ITS schema is pretty short.

# Sample schema that combines DocBook + ITS

# include standard DocBook schema
include "docbook.rnc"

# include ITS schema
include "its.rnc"
{
  start &= empty
}

# add ITS data categories to common attributes
db.common.attributes &= att.datacats.attributes

# allow documentRules inside info element
db.info.elements |= documentRules?

You can use this schema to validate DocBook documents containing ITS markup, or for guided editing in editors like oXygen or Emacs+nXML. The following listing demonstrates one such document.

<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="docbook-its.rnc" type="compact"?>
<article xmlns="http://docbook.org/ns/docbook" 
         xmlns:its="http://www.w3.org/2005/11/its" 
         version="5.0">
  <info>
    <title>Sample article</title>
    <its:documentRules>
      <its:ns its:prefix="db" its:uri="http://docbook.org/ns/docbook"/>
      <its:documentRule its:translate="no"
        its:translateSelector="//db:para/@*"/>
      <its:documentRule its:translate="yes"
        its:translateSelector="//db:para"/>
    </its:documentRules>
  </info>
  <para its:translate="no">Nontranslateable content</para>
  <sect1 status="draft">
    <title>Draft section</title>
    <para>Another content</para>
  </sect1>
</article>

My comments about ITS

Actually I did not read complete text of ITS specification. But there are some things that could be improved. At first, provided schema in RELAX NG uses bad naming conventions. All named patterns should use common prefix like its. so they can be easily and safely included into other schemas.

In addition I don't think that attributes used inside documentRules element and its descendants should be prefixed. It goes against common practice when global (namespace qualified) attributes are used only on foreign elements. So in my opinion the latest example should be rewritten as:

<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="docbook-its.rnc" type="compact"?>
<article xmlns="http://docbook.org/ns/docbook" 
         xmlns:its="http://www.w3.org/2005/11/its" 
         version="5.0">
  <info>
    <title>Sample article</title>
    <its:documentRules>
      <its:ns prefix="db" uri="http://docbook.org/ns/docbook"/>
      <its:documentRule translate="no" translateSelector="//db:para/@*"/>
      <its:documentRule translate="yes" translateSelector="//db:para"/>
    </its:documentRules>
  </info>
  <para its:translate="no">Nontranslateable content</para>
  <sect1 status="draft">
    <title>Draft section</title>
    <para>Another content</para>
  </sect1>
</article>

I hope that ITS group will be more responsive to my comments then XHTML WG. I sent several comments about XHTML 2.0 in past. No response after more then half of year. This is not how standards should be created.

[UPDATE] Its seems that ITS guys are doing their work well. Their response was fast and well grounded.

blog comments powered by Disqus
Copyright © Jiří Kosek, 2006–2018