Spreading the XML paradigm around
2006-03-15
Many content stored in DocBook is being translated into other languages. The translation is a complex problem and it is very expensive. One small issue in every automated translation system is the need for specifying content to translate, and not to translate. W3C is developing small XML vocabulary called ITS which can be used to identify translatable content. Lets see how we can integrate this into DocBook.
Table of Contents
I looked into ITS because one DocBook user had
requirement for identifying text that should not be
translated. The basic idea of ITS is very simple. You can use
attribute translate
(in ITS namespace) to
specify whether some content should be translated or not.
<para its:translate="no">Nontranslateable content</para>
There is also documentRules
element which can be used
to define general rules for selecting elements or attributes (based
on XPath pattern) to translate/not to translate.
As DocBook V5.0 is really very easy to customize, adding ITS
support into DocBook was matter of several minutes. Fortunately, W3C provides RELAX
NG schema for ITS (its.rnc
) so we can include it into
our schema together with the base DocBook schema. Then we just need to add
ITS data categories attributes to every DocBook element. This is very
easy as there is a named pattern
db.common.attributes
which defines attributes
available on every DocBook element. I also allowed
documentRules
element to appear inside info
element. The new DocBook+ITS
schema is pretty short.
# Sample schema that combines DocBook + ITS # include standard DocBook schema include "docbook.rnc" # include ITS schema include "its.rnc" { start &= empty } # add ITS data categories to common attributes db.common.attributes &= att.datacats.attributes # allow documentRules inside info element db.info.elements |= documentRules?
You can use this schema to validate DocBook documents containing ITS markup, or for guided editing in editors like oXygen or Emacs+nXML. The following listing demonstrates one such document.
<?xml version="1.0" encoding="UTF-8"?> <?oxygen RNGSchema="docbook-its.rnc" type="compact"?> <article xmlns="http://docbook.org/ns/docbook" xmlns:its="http://www.w3.org/2005/11/its" version="5.0"> <info> <title>Sample article</title> <its:documentRules> <its:ns its:prefix="db" its:uri="http://docbook.org/ns/docbook"/> <its:documentRule its:translate="no" its:translateSelector="//db:para/@*"/> <its:documentRule its:translate="yes" its:translateSelector="//db:para"/> </its:documentRules> </info> <para its:translate="no">Nontranslateable content</para> <sect1 status="draft"> <title>Draft section</title> <para>Another content</para> </sect1> </article>
Actually I did not read complete text of ITS specification. But
there are some things that could be improved. At first, provided
schema in RELAX NG uses bad naming conventions. All named patterns
should use common prefix like its.
so they can be easily
and safely included into other schemas.
In addition I don't think that attributes used inside documentRules
element and its descendants should be prefixed. It goes against common
practice when global (namespace qualified) attributes are used only on
foreign elements. So in my opinion the latest example should be
rewritten as:
<?xml version="1.0" encoding="UTF-8"?> <?oxygen RNGSchema="docbook-its.rnc" type="compact"?> <article xmlns="http://docbook.org/ns/docbook" xmlns:its="http://www.w3.org/2005/11/its" version="5.0"> <info> <title>Sample article</title> <its:documentRules> <its:ns prefix="db" uri="http://docbook.org/ns/docbook"/> <its:documentRule translate="no" translateSelector="//db:para/@*"/> <its:documentRule translate="yes" translateSelector="//db:para"/> </its:documentRules> </info> <para its:translate="no">Nontranslateable content</para> <sect1 status="draft"> <title>Draft section</title> <para>Another content</para> </sect1> </article>
I hope that ITS group will be more responsive to my comments then XHTML WG. I sent several comments about XHTML 2.0 in past. No response after more then half of year. This is not how standards should be created.
[UPDATE] Its seems that ITS guys are doing their work well. Their response was fast and well grounded.