Mi blog lah! Το ιστολόγιό μου

18May/092

Generating multilingual PDF files out of GNOME documentation

Robert describes how to generate PDF files from GNOME documentation source files.

We describe here how to manually generate PDF files from translated documentation.

The relevant gnome-doc-utils files for documentation generation is at http://git.gnome.org/cgit/gnome-doc-utils/tree/tools. As far as I understand from reading the makefile, there is no support yet to build PDFs out of localised documentation.

Let's assume we want to generate PDF documentation for the GNOME 2 User Guide, for the Greek language.

1. We clone the relevant repository

$ git clone git://git.gnome.org/gnome-user-docs

2. Then,

cd gnome-user-docs/gnome2-user-guide/

and now generate the equivalent XML files found in C/ with the localisation for the Greek language (found in el/),

\ls C/*.xml | perl -n -e 'chop; $a=$_; print "xml2po -p el/el.po $a > el/`basename $a`\n"' | sh

3. Let's see what we created,

$ ls el/
el.po    glossary.xml  goscustdesk.xml     
gosfeedback.xml  gosoverview.xml  gosstartsession.xml 
legal.xml  figures  gosbasic.xml  goseditmainmenu.xml 
gosnautilus.xml  gospanel.xml     gostools.xml        
user-guide.xml
$ _

The main file is user-guide.xml, which references the rest of the XML files.

4. One additional step I like to do is convert all those individual .xml in a single big XML file just before performing the conversion to PDF. This helps to figure out any markup mistakes that could have been caused during the translation.

cd el/
xmllint --noent user-guide.xml --output documentation-user-guide.xml

This step does not have the desired effect with the XML in the user-guide because the include files are referenced with “<include xmlns="http://www.w3.org/2001/XInclude" href="gosbasic.xml"/>” rather than the “&gosbasic;” which "xmllint --noent" appears to like. Other GNOME documentation use the latter style. Lazyweb, any tips so that xmllint can create a single big fat XML file?

4. You may want to manually populate the figures/ directory. That is, copy any C/figures/* files that are not present in your LL/figures/ directory.

5. In order to create PDF files in NON-iso-8859-1, we need to use the xetex backend,

dblatex --backend=xetex --verbose documentation-user-guide.xml

If all go well, you are greeted with a documentation-user-guide.pdf document.

Two things may go wrong here; there is either an invalid construct in the XML file or you have stumbled on a XeTex bug for your language.

For the Greek language there is a known bug and a workaround for XeTex.

Here is the GNOME2 User Guide (PDF) for

  1. Greek (bit messy hyphenation due to workaround), and
  2. Russian.

I also tried the user-guide with

Chinese: Fails to compile, encoding problem or most probably limitation in XeTeX.

Thai: Compiles just fine but no Thai font is available. How do you add fonts to XeTeX?

Punjabi: Compiles just fine but no Hindi font is available. How do you add fonts to XeTeX?

22Nov/082

Help make «DocBook XML to PDF» work for Greek

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
<article lang="en">
<section><title>Title</title>
<para>ëãáâṩẫëĝéõōåőȩçą</para>
<para>ЁЂыњѨѬѺѸѶѦщЖЊЌЍШЩзф</para>
<para>ᾶᾳὰέᾁᾂδϕϟϸϡϸϸαϷϕϲδϕϛ€ϕ©ϖϐͻ©ϖϐ</para>
</section>
</article>

This is an issue that I would appreciate if someone could help in solving.

The above document (mytestfile.xml) is a DocBook XML document with text in many scripts (latin, cyrillic and greek). Normally it was difficult to convert to PDF, until recently.

Now, one can run

dblatex --backend=xetex --verbose mytestfile.xml

(requires to install the dblatex package and any dependencies) and it creates mytestfile.pdf. If you have a fresh installation of Ubuntu 8.10 and you go through the process of installing these packages, please make a list of them, to use as advice for new users.

Generated PDF document with lang=en

Generated PDF document with lang=en

Since we use XeTeX as a backend, we can work with Unicode text directly, which is the proper thing to do. Above you can see that all characters are shown (except a few obscure ones that are not found in DejaVu Sans and are shown as boxes). You can see Latin (+Extended), Cyrillic (+Extended), Greek (+Extended) in the same document.

Generated PDF document with lang=el

Generated PDF document with lang=el

The issue arises when we change the lang modifier in the document above, from en to el. Here you see Τιτλε, which in fact is Title but with the characters replaced with their Greek equivalent. This is a sign for non-Unicode, 8-bit encoding conversion issue. In addition, some of the rest of the characters are shown, and apparently a strange conversion took place.

What we need to do is figure out is how to fix xetex when 'lang=el'. There is some work to get Greek XeTeX support upstream, and there are instructions on how to add local Greek XeTeX support in your distribution.

What we need is instructions on how to fix the Greek XeTeX support in Ubuntu 8.10, and test that dblatex can generate documents correctly when lang=el.

For your testing, here are the files mytestfile-en.pdf, mytestfile-el.pdf, mytestfile-en.xml, mytestfile-el.xml.