Dan Scott - Conversion from HTML to DocBook v4.1.2 (XML)
This is a detailed guide on how to install and use Indic scripts (devanagri etc.) using UTF-8 encoding under GNU/Linux. This HOWTO is a work in progress. More sections regarding fonts and other related things shall be added to this HOWTO in due course of time. Special thanks to Dan Scott for conversion from HTML to DocBook v4.1.2(XML). Any feedback, sugestions, pointers, gifts, cds, BMWs will be gladly accepted. All flames will be redirected to /mnt/praises_for_thee/ for future reference. Be afraid.
This HOWTO has been written to help you setup your Linux box to use UTF-8 encoding for using various Indic scripts. You will have to install the IndiX system developed by NCST, Mumbai on your machine in order for you to use various Indic scripts. I have tested the IndiX system on Exodus GNU/Linux, RedHat Linux, and Mandrake Linux. Anyone who has tested this system on a machine running Debian, please let me know and I will include that in this HOWTO. I want to thank Mr. Keyur Shroff from NCST, Mumbai for allowing me to modify and redistribute his Devanagri-HOWTO.
Please note that Exodus GNU/Linux, developed by the good guys at Centurion Linux, India will ship with the IndiX system installed, thanks to the Transfer of Technology deal signed by NCST, Mumbai and Centurion Linux Pvt. Ltd.
Almost all of the leading GNU/Linux distributions available today have been localized in various international languages like French, German, Spanish, Chinese, Arabic, etc. This HOWTO aims at documenting the steps involved in enabling you to localize your GNU/Linux distribution to Indic scripts of your choice. To begin with, you must be aware of the complexity involved in localizing any of the Indian languages. Any Indian language text input differs from that of English. Perhaps the most significant difference is that in English, each keystroke maps directly onto a letter where each letter has a unique code. On the other hand, a 'syllable' - the Indian language equivalent unit of writing letter is composed of one or more characters entered through the keyboard.
The syllable is composed of vowels, consonants, modifiers and other special graphics signs. These are encoded, just as roman letters are. The user types in a sequence of vowels, consonants, modifiers and the graphics signs. The machine then composes these syllables at run time based on language dependent rules. Every syllable is thus represented in the machine as a unique sequence of vowels, consonants and modifiers. In a text sequence, these characters are stored in logical (phonetic) order.
Indic characters can combine or change shape depending on their context. A character's appearance is affected by its ordering with respect to other characters, the font used to render the character, and the application or system environment. These variables can cause the appearance of Devanagari characters to be different from their nominal glyphs (used in the code charts). Additionally, characters cause a change in the order of the displayed glyphs. This reordering is not commonly seen in non-Indic scripts and occurs independent of any bi-directional character reordering that might be required.
Each syllable has a unique visual representation. However, there are too many syllables to design glyphs for each one individually. So a font normally contains certain component glyphs from which a syllable is composed at run time. The onscreen representation of a syllable is then a composition of glyphs from the Indian language font. There is no direct mapping of glyph codes to the consonant, vowel or modifier codes. However, for every syllable (a sequence of consonants, vowels and modifiers) there is a corresponding sequence of glyphs. This constitutes a many-to-many mapping from keystrokes to glyphs as opposed to a simplistic one-to-one mapping in roman scripts.
The Indix system developed by NCST, Mumbai enables most applications in X Windows (irrespective of the toolkit used), to render Indic characters according to the unicode standard specification. IndiX provides support for OpenType fonts and Unicode encoding at X Windows level. This enables most of the existing applications to handle Indic scripts without any modification or recompilation.
Once you have installed the IndiX system, following all the steps mentioned in this HOWTO, you will be able to fly across seven seas and slap that annoying sailor who keeps goin' hic' hic'... Okay, on a more serious note, you will be able to enjoy your Linux experience in Devanagri and other Indic scripts of your choice.
You can obtain the IndiX system from NCST, Mumbai site http://rohini.ncst.ernet.in/indix/. The system is available in its source as well as binary form. This HOWTO covers the installation of the IndiX system using the binary files avaiable for download. At a later stage, I plan to cover the source installation of IndiX on your box, too. You need to download the following files in order to install IndiX sucessfully onto your machine:
NCST has written Simpm ( Simple Package Manager ) that takes care of the entire installation process on your system. Simpm carries out the following steps for a binary distribution of the IndiX system:
simpm with no arguments/parameters will display its usage.
To install the IndiX system, all you have to do is (pray and do your favourite tribal dance) type in the following commands:
Congratulations, o' most precious one, on having installed IndiX system on your machine. The remainder of this HOWTO will focus on setting up your Linux environment to support Indic fonts and scripts in X.
Devanagari characters do not display properly in a Linux console. However, NCST has developed ncst-term (a terminal emulator program in X Window System) which has support for converting keystrokes to UTF-8 before sending them to the application running in the ncst-term, and for displaying Unicode characters that the application outputs as UTF-8 byte sequence.
You need to make some changes in your XF86Config-4 file (usually resides in /etc/X11/ directory). A sample config file XF86Config-4.indix is installed along with IndiX system. This file can be found in /etc/X11/ directory.
OpenType is the most suitable font format to render any Indic script properly. The IndiX system ships with one OpenType font called "raghu" for Hindi. Anyone can use and distribute this font free-of-cost. You can find this font in /usr/X11R6/lib/X11/fonts/TrueType/ directory.
Installing the Indic Fonts:
In order to install the Indic fonts, you must log in as root. The X Font Server (xfs) is known to have some problems with the IndiX system, so remove it from the FontPath of the X Server. This can be achieved by modifying your XF86Config-4 file (usually in /etc/X11/) and commenting the line in the Files section and adding /usr/X11R6/lib/X11/fonts/TrueType/ to the current FontPath.
After that, the FontPath should look something similar to this:
The IndiX system comes with a keyboard map file for xmodmap. You can use the utility xmodmap to map a Devanagri keyboard. For most distributions, when you start X, the X-Server will look for a Xmodmap in /etc/X11/ directory. If that file does not exist, the server will look for a .Xmodmap in your $HOME. Just putting the .Xmodmap in your $HOME will be okay. When you start the X server, it will load this file. You can also load .Xmodmap from the command line:
You can now use any Unicode characters in file names. No kernel or file utilities need modifications. This is because file names in the kernel can be anything not containing a null byte, and '/' is used to delimit subdirectories. When encoded using UTF-8, non-ASCII characters will never be encoded using null bytes or slashes. All that happens is that file and directory names occupy more bytes than they contain characters. For example, a filename consisting of five greek characters will appear to the kernel as a 10-byte filename. The kernel does not know (and does not need to know) that these bytes are displayed as greek.
This is the general theory, so long as your files reside on Linux. On filesystems which are used from other operating systems, you have mount options to control conversion of filenames to or from UTF-8:
The other filesystems (nfs, smbfs, ncpfs, hpfs, etc.) don't convert filenames; therefore they support Unicode file names in UTF-8 encoding only if the other operating system supports them. Please note that to enable a mount option for all future remounts, you add it to the fourth column of the corresponding /etc/fstab line.
You should have the following environment variables set, containing locale names:
In order to tell your system and all applications that you are using UTF-8, you need to add a codeset suffix of UTF-8 to your locale names. For example, if you want to run an application in UTF-8 Hindi locale then with bash shell, you can specify which environment variable to be passed to the application.
Netscape 6.01 or later can display HTML documents in UTF-8 encoding. All a document needs is the following line between the <head> and </head> tags:
To setup Netscape so that it displays Hindi characters:
Also, ensure that the character coding scheme is set to UTF-8
Konqueror has good support for Unicode. To setup konqueror so that it displays Hindi characters:
yudit by Gáspár Sinai (http://czyborra.com/yudit/) is an excellent unicode text editor for the X Window System. It supports simultaneous processing of many languages, input methods, conversions for local character standards etc. It has facilities for entering text in all languages with only an English keyboard, using keyboard configuration maps. Customization is very easy. Typically you will first want to customize your font. From the font menu, choose "Unicode". Next, you should customize your input method. The input methods "Straight", "Unicode" and "SGML" are most remarkable. For details about the other built-in input methods, look in /usr/local/share/yudit/data/. To make a change the default for the next session, edit your $HOME/.yuditrc file. The general editor functionality is limited to editing, cut and paste and search and replace. There is no provision for an undo. yudit can display text using a TrueType font. But it doesn't seem to support combining characters.
Vim (as of version 6.0) has good support for UTF-8. When started in an UTF-8 locale, it assumes UTF-8 encoding for the console and the text files being edited. It supports double-wide (CJK) characters as well and combining characters and therefore fits perfectly into UTF-8 enabled ncst-term.
gedit is an editor developed using GtkText widget. gedit-0.9.0 does not support FontSet. This means that you can't edit both English and Hindi text simultaneously. But if you choose a proper font then you will be able to use any one language at a time.
With XFree86-4.0.1, xedit is capable of editing UTF-8 files if your locale is set appropriately. Add the line
Mail clients released after January 1, 1999, should be capable of sending and displaying UTF-8 encoded mails, otherwise they are considered deficient. But these mails have to carry the MIME labels:
Simply piping an UTF-8 file into "mail" without caring about the MIME labels will not work. Mail client implementors should take a look at http://www.imc.org/imc-intl/ and http://www.imc.org/mail-i18n.html.
Now about some of the individual mail clients (or "mail user agents"):
kmail (as of KDE 1.0) does not support UTF-8 mails at all.
Netscape Mail can send and display mails in UTF-8 encoding, but it needs a little bit of manual user intervention. To send an UTF-8 encoded mail:
When you receive an UTF-8 encoded mail, Netscape does not display it in UTF-8 right away, and does not even give a visual clue that the mail was encoded in UTF-8. You have to manually select from the menu-> -> .
For displaying UTF-8 mails, Netscape uses different fonts. You can adjust your font settings in the-> -> dialog by selecting the "Unicode" font category.
exmh 2.1.2 with Tk 8.4a1 can recognize and correctly display UTF-8 mails if you add the following lines to your $HOME/.Xdefaults file.
Please make sure that the font 'xyz' is correctly installed and is in the current FontPath. The Indic fonts usually reside in the /usr/X11R6/lib/X11/fonts/TrueType/ directory. Your FontPath is defined in the /etc/X11/XFree86Config-4 file. To learn more about howto specify your FontPath, read the section on X Window System (3.2) in this HOWTO.
You can load an Indic script font by giving command line server option while starting X Window System. e.g.
This could possibly be due to the fact that your Hindi locale has not been setup correctly. To change/set the locale you should set LANG environment variable. Append the line
This is probably because the X Font Server (xfs) is running and is still in the current FontPath. You can either shutdown the X Font Server or remove it from the current FontPath. To shutdown xfs issue the following command after becoming root:
IndiX system uses an OpenType font to render Indic script characters, as it is the most suitable font format for Indic scripts. If you use some other kind of font, for example a TrueType font or a Bitmap font, then the font does not have enough information that is required to render Indic script text properly. So it is recommended to use only OpenType fonts for Indic scripts. Also, in case you are already using an OpenType font, please update your glibc.
The good guys at Centurion Linux are looking for sponsors who can take care of their hosting needs. If you are interested in helping Centurion Linux out, please contact me on <firstname.lastname@example.org>.
Parts of this HOWTO have been taken from The Unicode HOWTO by Bruno Haible and The Devanagri HOWTO by Keyur Shroff.
I would also like to take this opportunity to thank my papa, mummy and my brothers Manvinder and Kulvinder for their unconditional love and support, without whom I could never have achieved anything in life. Forever, I love you. Loshaca :)
To Girija, my girlfriend: :) Thanks for everything.
I am very grateful to Keyur Shroff for allowing me to modify and redistribute his Devanagri HOWTO. Special thanks go out to him for his guidance, help, and support.
Thanks to Rohan D'Sa and Manvinder Bali of Centurion Linux for having helped me with various UTF-8 and Indic scripts issues. Also, thanks for representing Centurion Linux at the Business Technology meet organised by Ministry of Information Technology, New Delhi.
Once again, special thanks to Dan Scott for converting the HOWTO to DocBook XML format. Thanks Dan :)
The purpose of this License is to make a manual, textbook, or other written document "free" in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a way to get credit for their work, while not being considered responsible for modifications made by others.
This License is a kind of "copyleft", which means that derivative works of the document must themselves be free in the same sense. It complements the GNU General Public License, which is a copyleft license designed for free software.
We have designed this License in order to use it for manuals for free software, because free software needs free documentation: a free program should come with manuals providing the same freedoms that the software does. But this License is not limited to software manuals; it can be used for any textual work, regardless of subject matter or whether it is published as a printed book. We recommend this License principally for works whose purpose is instruction or reference.
This License applies to any manual or other work that contains a notice placed by the copyright holder saying it can be distributed under the terms of this License. The "Document", below, refers to any such manual or work. Any member of the public is a licensee, and is addressed as "you".
A "Modified Version" of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language.
A "Secondary Section" is a named appendix or a front-matter section of the Document that deals exclusively with the relationship of the publishers or authors of the Document to the Document's overall subject (or to related matters) and contains nothing that could fall directly within that overall subject. (For example, if the Document is in part a textbook of mathematics, a Secondary Section may not explain any mathematics.) The relationship could be a matter of historical connection with the subject or with related matters, or of legal, commercial, philosophical, ethical or political position regarding them.
The "Cover Texts" are certain short passages of text that are listed, as Front-Cover Texts or Back-Cover Texts, in the notice that says that the Document is released under this License.
A "Transparent" copy of the Document means a machine-readable copy, represented in a format whose specification is available to the general public, whose contents can be viewed and edited directly and straightforwardly with generic text editors or (for images composed of pixels) generic paint programs or (for drawings) some widely available drawing editor, and that is suitable for input to text formatters or for automatic translation to a variety of formats suitable for input to text formatters. A copy made in an otherwise Transparent file format whose markup has been designed to thwart or discourage subsequent modification by readers is not Transparent. A copy that is not "Transparent" is called "Opaque".
Examples of suitable formats for Transparent copies include plain ASCII without markup, Texinfo input format, LaTeX input format, SGML or XML using a publicly available DTD, and standard-conforming simple HTML designed for human modification. Opaque formats include PostScript, PDF, proprietary formats that can be read and edited only by proprietary word processors, SGML or XML for which the DTD and/or processing tools are not generally available, and the machine-generated HTML produced by some word processors for output purposes only.
The "Title Page" means, for a printed book, the title page itself, plus such following pages as are needed to hold, legibly, the material this License requires to appear in the title page. For works in formats which do not have any title page as such, "Title Page" means the text near the most prominent appearance of the work's title, preceding the beginning of the body of the text.
You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.
You may also lend copies, under the same conditions stated above, and you may publicly display copies.
If you publish printed copies of the Document numbering more than 100, and the Document's license notice requires Cover Texts, you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on the back cover. Both covers must also clearly and legibly identify you as the publisher of these copies. The front cover must present the full title with all words of the title equally prominent and visible. You may add other material on the covers in addition. Copying with changes limited to the covers, as long as they preserve the title of the Document and satisfy these conditions, can be treated as verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit legibly, you should put the first ones listed (as many as fit reasonably) on the actual cover, and continue the rest onto adjacent pages.
If you publish or distribute Opaque copies of the Document numbering more than 100, you must either include a machine-readable Transparent copy along with each Opaque copy, or state in or with each Opaque copy a publicly-accessible computer-network location containing a complete Transparent copy of the Document, free of added material, which the general network-using public has access to download anonymously at no charge using public-standard network protocols. If you use the latter option, you must take reasonably prudent steps, when you begin distribution of Opaque copies in quantity, to ensure that this Transparent copy will remain thus accessible at the stated location until at least one year after the last time you distribute an Opaque copy (directly or through your agents or retailers) of that edition to the public.
It is requested, but not required, that you contact the authors of the Document well before redistributing any large number of copies, to give them a chance to provide you with an updated version of the Document.
You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document, thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it. In addition, you must do these things in the Modified Version:
If the Modified Version includes new front-matter sections or appendices that qualify as Secondary Sections and contain no material copied from the Document, you may at your option designate some or all of these sections as invariant. To do this, add their titles to the list of Invariant Sections in the Modified Version's license notice. These titles must be distinct from any other section titles.
You may add a section entitled "Endorsements", provided it contains nothing but endorsements of your Modified Version by various parties--for example, statements of peer review or that the text has been approved by an organization as the authoritative definition of a standard.
You may add a passage of up to five words as a Front-Cover Text, and a passage of up to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. Only one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through arrangements made by) any one entity. If the Document already includes a cover text for the same cover, previously added by you or by arrangement made by the same entity you are acting on behalf of, you may not add another; but you may replace the old one, on explicit permission from the previous publisher that added the old one.
You may combine the Document with other documents released under this License, under the terms defined in section 4 above for modified versions, provided that you include in the combination all of the Invariant Sections of all of the original documents, unmodified, and list them all as Invariant Sections of your combined work in its license notice.
The combined work need only contain one copy of this License, and multiple identical Invariant Sections may be replaced with a single copy. If there are multiple Invariant Sections with the same name but different contents, make the title of each such section unique by adding at the end of it, in parentheses, the name of the original author or publisher of that section if known, or else a unique number. Make the same adjustment to the section titles in the list of Invariant Sections in the license notice of the combined work.
In the combination, you must combine any sections entitled "History" in the various original documents, forming one section entitled "History"; likewise combine any sections entitled "Acknowledgements", and any sections entitled "Dedications". You must delete all sections entitled "Endorsements."
You may make a collection consisting of the Document and other documents released under this License, and replace the individual copies of this License in the various documents with a single copy that is included in the collection, provided that you follow the rules of this License for verbatim copying of each of the documents in all other respects.
You may extract a single document from such a collection, and dispbibute it individually under this License, provided you insert a copy of this License into the extracted document, and follow this License in all other respects regarding verbatim copying of that document.
A compilation of the Document or its derivatives with other separate and independent documents or works, in or on a volume of a storage or distribution medium, does not as a whole count as a Modified Version of the Document, provided no compilation copyright is claimed for the compilation. Such a compilation is called an "aggregate", and this License does not apply to the other self-contained works thus compiled with the Document , on account of their being thus compiled, if they are not themselves derivative works of the Document. If the Cover Text requirement of section 3 is applicable to these copies of the Document, then if the Document is less than one quarter of the entire aggregate, the Document's Cover Texts may be placed on covers that surround only the Document within the aggregate. Otherwise they must appear on covers around the whole aggregate.
Translation is considered a kind of modification, so you may distribute translations of the Document under the terms of section 4. Replacing Invariant Sections with translations requires special permission from their copyright holders, but you may include translations of some or all Invariant Sections in addition to the original versions of these Invariant Sections. You may include a translation of this License provided that you also include the original English version of this License. In case of a disagreement between the translation and the original English version of this License, the original English version will prevail.
You may not copy, modify, sublicense, or distribute the Document except as expressly provided for under this License. Any other attempt to copy, modify, sublicense or distribute the Document is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
The Free Software Foundation may publish new, revised versions of the GNU Free Documentation License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. See http://www.gnu.org/copyleft/.
Each version of the License is given a distinguishing version number. If the Document specifies that a particular numbered version of this License "or any later version" applies to it, you have the option of following the terms and conditions either of that specified version or of any later version that has been published (not as a draft) by the Free Software Foundation. If the Document does not specify a version number of this License, you may choose any version ever published (not as a draft) by the Free Software Foundation.
To use this License in a document you have written, include a copy of the License in the document and put the following copyright and license notices just after the title page:
If you have no Invariant Sections, write "with no Invariant Sections" instead of saying which ones are invariant. If you have no Front-Cover Texts, write "no Front-Cover Texts" instead of "Front-Cover Texts being LIST"; likewise for Back-Cover Texts.
If your document contains nontrivial examples of program code, we recommend releasing these examples in parallel under your choice of free software license, such as the GNU General Public License, to permit their use in free software.