A computer system
uses graphical means to display text where the letters and their symbols
are displayed using appropriate fonts. In respect of English and most European
languages, standards which relate character codes to specific displayed
shapes have been in vogue for several years. The ASCII code is standard
for the Roman alphabet but it is also used for other languages where the
letters carry diacritical marks. The ASCII code is an eight bit code that
permits 256 different characters to be represented.
In multilingual documents,
one will find more than 256 symbols and hence the concept of ASCII cannot
be applied directly. Often one uses the principle of Code Pages where the
interpretation of the ASCII code changes to reflect the display of text
in a different language. This way one selects a specific code page at a
time and this is used to map the code to the displayed shapes. Code page
switching is meaningful for languages/scripts which employ a limited set
of characters in displayed text. Thus one associates a character set with
the text and the display is effected through the use of fonts which conform
to the character set.
In respect of Indian languages,
it has not been possible to define any specific character sets since the
writing systems employ literally thousands of different shapes. What has
been generally accomplished in the past is that a minimal set of shapes
(often restricted to about 180-200) is defined and the syllables to be
displayed composed from these shapes. This way, eight bit fonts have been
used to display text in most Indian languages/scripts. Unfortunately, the
problem of dealing with text on a computer relates more to identifying
the linguistic content from the text displayed, for text processing in
Indian languages has to be attempted at the level of an akshara (Samyuktakshara)
which is essentially a syllable.
Codes for the character sets
from different languages may be pooled together to form a much larger set
than 256, requiring the use of 16 bit codes and corresponding fonts. This
is the idea behind Unicode which has become a new standard for text representation.
Yet, the assignment of codes for languages/scripts which are employ a syllabic
writing system continues to pose problems for unicode since the code space
given to such languages is restricted only to the basic consonants, vowels
and medial vowel forms. One is forced to code syllables in terms of variable
length representations. This poses fairly serious problems for rendering
text as well as linguistic processing. It is very difficult to map a variable
length code to a displayed shape.
The Indian language computing
scene continues to pose challenges for Software Developers since no standards
exist for text rendering. The problem of text rendering is discussed in
detail elsewhere in these pages.
As of this writing (June
2006), Computing with Indian languages continues to remain complex in the
absence of any agreed standard that works well for all the Indian languages.
As a result it is difficult to disseminate information in Indian lanaguges
through the web where multitudes of systems and software have to agree
to deal with syallbles rather than letters.
Computing with Indian languages
will remain complex unless one seriously considers text processing at the
level of a syllable with fixed length codes.