![]() |
![]() ![]() |
![]() |
![]() |
![]() |
![]() |
| Home --> Software Design Issues --> Existing coding schemes for Indian language text |
Internal representation of text in Indian Languages may be viewed as the problem of assigning codes to the aksharas of the languages. The complexities of the syllabic writing systems in use have presented difficulties in standardizing internal representations. TeX was an inspiration in the late 1980s but using TeX was more suited for Typesetting and not Text processing per se. In the absence of appropriate fonts, interactive applications could not be attempted and when fonts became available, applications simply used the Glyph positions as the codes and the number of Glyphs was restricted on account of the eight bit fonts. |
About ISCII About Unicode for Indian Languages Detailed Discussion of Unicode for Indian Languages Report
from CDAC on character encoding standards for Indian Scripts
Multilingualism
and the Internet
|
Indian Script Code for Information Interchange (ISCII) ![]()
Top of Page |
|
Unicode for Indian Languages ![]() Technically, Unicode can handle many more languages than the supported scripts if these languages use the same script in their writing systems. By consolidating a complete set of symbols used in the writing systems across a family of languages, one can get a script that caters to all of them. The Latin script with its supplementary characters and extended symbol has about 550 different characters and this is quite adequate to handle almost anything that has appeared in print in respect of the Latin script. Hence in the geometrical view above, some planes may be larger (wider) than others and more than one script could have characters from logically similar groups specified in a plane. ![]()
From the discussion above, it will be seen that ISCII and Unicode provide multibyte representations for syllables. This is not unlike the case for English and other European languages where syllables are shown only with the basic letters of the Alphabet. However, in all the writing systems used in India, each syllable is individually identifiable through a unique shape and one has to provide for thousands of shapes while rendering text. |
Unicode
Related Information
Many questions relating to Unicode for Indic Scripts have been answered at the Unicode web site. __________________ The site maintained by Alan
Wood provides extensive coverage of Unicode including Unicode resources
for Indian languages
__________________ UTF-8, the method used for moving Unicode Data across systems and displaying Unicode encoded documents on Web Browsers. Link to an excellent discussion provided by Markus Kuhn __________________ While
most people say exciting things about Unicode, there are a few who share
our concern about its weak points. Here
are some observations made by an expert.
|
Specific technical problems with ISCII and Unicode. It must be observed, in the light of the above discussion that displaying a Unicode string in Indian language requires a complex piece of processing software to identify the syllables and get the corresponding glyphs from an appropriate font for the script. The multibyte nature of Unicode (for a syllable) makes a table driven approach to this quite difficult. Even though it is possible to write such modules which can go from Unicode to the display of text using some font, one faces a formidable problem in respect of data entry, where formation of syllables from multiple key sequences Is truly overwhelming. With limited number of keys available in standard keyboards, it is often not possible to accommodate all the symbols one would require to produce meaningful printouts in each script consistent with quality typesetting systems. Unicode based applications employ the concept of "Locales" to permit data entry of multilingual text. Each Locale is associated with its own keyboard mapping and application software can switch Locales to permit data entry of multilingual text. It will be seen that for Indian scripts, the Locales themselves have limitations since they do not permit a full complement of letters and special characters to be typed in, much less the standard punctuation that has become part of Indian scripts today. While it is possible to write special keyboard driver programs which implement a state machine to handle key sequences to produce syllables, the approach is not universal enough to be included into the Operating Systems, certainly not when a single driver should cater to all the Indian scripts. There is no meaning in having a Hindi version of OS with its own Data entry convention which differs substantially from a Tamil or Telugu version. Here is a summary of the issues that confront us when dealing with Unicode for Indian scripts.
Details
of Unicode for Indian scripts have been published in the standard available
from the Unicode consortium. The Unicode
web site does have useful information but one will have to resort to
the printed text to get the real details. These are also available in PDF
format from the above web site.
Is Unicode for Indian Languages meaningless then ? The answer is certainly No. The main purpose of the Unicode is to transport information across computer systems. As of today, Unicode is reasonably adequate to do this job since it does provide for representing text at the syllable level though not in the fixed size units (Bytes). Applications dealing with Indian Languages will have to include a special layer which transforms Unicode text into a more meaningful layer for linguistic or text processing purposes. The point to keep in mind is that the seven bit ASCII based representation for most World language serves both purposes well i.e., not only are text strings transferable across systems, but linguistic processing is consistent with the seven bit representation . It so happens that the phonetic nature of our Indian Languages has necessitated a different representation for linguistic analysis. With majority of the Languages
of the World, which use a relatively small set of symbols to represent
the letters of their alphabet, 8 bit (or even 7 bit) character codes are
adequate to represent the letters.
|
Please refer to the FAQ provided at the Unicode web site which provides answers to some of the questions raised here. The real issue to understand is whether Unicode is adequate from the point of view of efficient text processing of Syllables so that one may attempt meaningful processing of text in Indian languages, consistent with the syllabic writing system. |
Acharya Logo |
Local Time: 11 47 52 Kali Year 5111 Month: Minam , Day:4 Star: Revati |