Electronic
Representation of Text
Electronic processing
of text in any language requires that characters (letters of the alphabet
along with special symbols) be represented through unique codes. Usually,
this code will also correspond to the written shape of the letter. A code
is basically a number associated with each letter so that computers
can distinguish between different letters through their codes. The ASCII
code is the standard by which the Roman alphabet is handled.
The code serves the
important purpose of standardizing the approach to dealing with text
on different computer systems. As of today, the ASCII code is probably
the only code correctly identified on all the computer systems. Information
made up of pure ASCII coded text is thus viewable on almost all computers.
Email is an example of an application that works on almost all the computers
since it uses pure ASCII coded text in the messages.
In the early days
of information processing, the Roman character set served as the basis
for interacting with computer systems. European languages which use a slightly
different set of character codes manage quite well with the ASCII approach
by replacing a few of the special characters in ASCII by symbols specific
to each language.
The ASCII code covers
a range of 128 characters of which 96 (codes from 32 to 127) are
reckoned as standard displayable ASCII. Actually eight bits are used in
ASCII but the Codes between 128 and 255 are often used for displaying
symbols useful for tabular information, graphics etc.. Some European
languages such as Greek, Russian etc., also support codes for their
languages in the range 128 to 255, to allow bilingual information
(English as well s the specific language) to be displayed easily. The international
Standards Organization has come up with recommendations known as the
Latin Character sets, which encode the alphabet of the different European
languages. For an excellent review of these the viewer is encouraged
to look at Internationalization.
The fundamental
idea in standardizing character codes is to allow data entry to proceed
using the standard QWERTY keyboard. Most word processors and text editors
associate the keys on the keyboard with specific ASCII codes and hence
can support data entry in any language that assigns the displayable ASCII
codes to its alphabet. Display is effected through the use of fonts, where
associated with each ASCII code is the shape of the letter that should
be displayed. Fonts typically deal with specific codes representing different
character sets (known as font encoding).The displayed shape or form of
the character will also differ from font to font though the fonts may be
encoded identically. This allows us to display a given string of text using
shapes suited to a given requirement.
Since there are many
languages and the ASCII encoding supports only 96 regular letters for data
entry and display, some special mechanism is needed to associate the codes
with the letters of different languages, if multilingual information is
to be displayed. This mechanism has been provided through Unicode,
which is essentially derived from ASCII but provides some means of
identifying the script associated with the characters. Unicode caters to
a very large set of characters representing several scripts of the world.
Unicode is yet
to be recognized by many word processing and data entry software running
on different computers of the world (as of today, Jan. 2005). Microsoft
Windows (Windows 2000/XP), Linux and Java support Unicode but bulk of the
systems continue the plain ASCII approach. Unicode
is an international standard that should necessarily be understood by persons
developing multilingual software. Web browsers provide support for displaying
Unicode text on different computer systems.
Before we take
up the issue of coding Indian language characters, we make some important
observations.
-
Eight bit character codes are
entirely adequate for languages whose alphabet is a small set and the written
text consists of only the individual letters themselves and possibly
some punctuation.
-
Data entry in any language can
be effected with ease by encoding the letters of the alphabet along the
lines of the ASCII code and using appropriate fonts with the Word processing
software. If the codes are also assigned in the range 128-255, then
data entry is not straightforward and will require special input mechanisms.
A document
prepared on the basis of Unicode encodings, will be truly Multilingual.
However, data entry on word processors supporting Unicode (e.g., Microsoft
Word) still remains cumbersome. More about this in the section on
Unicode for Indian Languages.
Return
to Top
Is there a character
set for Indian languages?
Any
attempt at encoding text in Indian languages has to address this important
question. While it is true that all Indian languages have a phonetic base
built on top of a fixed number of vowels and consonants, the writing
systems permit many different shapes to be generated depending on the syllables
in the text. In a way this may be likened to the addition of ligatures.
The ligature is a special shape that is added to the basic shape of a consonant
when syllables are formed with the consonant through combinations with
other consonants and finally, a vowel. The writing systems for Indian languages
provide for representing thousands of combinations of the basic consonants
and vowels.
The Samyuktaksharas
or the conjunct characters which the writing systems use, represent combinations
of sounds. Linguistically, the Akshara is the basic quantum or measure
used in reckoning the number of sound combinations within a word and poetic
Metre is specified according to the number of of such aksharas in each
line of verse. We illustrate this through an example. The verse
shown below is the opening verse of the famous Bhagavadgita.
The aksharas in the
verse are individually identified in the representation and in this
specific Metre, each line of verse contains two groups of eight aksharas.
In sanskrit
and other Indian languages, one observes strict adherence to the rules
for the Metre. If one has to work with text from any linguistic
angle, one sees the need to identify and and work with aksharas which
include combinations of consonants and a vowel. It is known that
there are several thousand combinations, each having an individual
representation, even though all of them are derived from a basic set of
about thirty five consonants and sixteen vowels. On the basis of
this observation, the question that comes to our mind is "what will constitute
the character set for Indian Languages?". We may approach the question
from two or three different viewpoints or approaches to character coding.
Return
to Top
Internal
Representation: Approach-1
Treat
the basic set of consonants and vowels as the character set and recommend
a code for each consonant and vowel.
It will be easy to
accommodate this set within the range of displayable ASCII. However, this
alone will not work in practice, for in Indian languages, a consonant vowel
combination is a single Akshara and cannot be represented by writing the
vowel after the consonant. It is possible to assign additional codes
for the ligatures (vowel extensions which are called Matras), so that
consonant vowel combinations are are also handled through the codes. This
approach has the advantage that conventional word processors or editors
may be used to prepare text if appropriate fonts are available.
This is basically
the approach taken in the ISCII scheme,
a standard that was proposed in the eighties and was revised during
the early nineties. The ISCII values were assigned in the range
160-255 and so one could work with Roman and Indian Scripts simultaneously.
However, data entry is not straightforward from any standard word processor
and special software is required. The Center for Development of Advanced
Computing (CDAC) had pioneered the development of systems for Multilingual
text but they approached the problem partly through hardware solutions
for the PC platforms. In recent times they have brought out Windows
based software (ILeap) but the earlier problems continue to remain in respect
of error free transliteration across languages as well as lack of preservation
of the sorting order in Southern languages.
The primary objection
to the ISCII approach, which is an eight bit representation of the
consonants and vowels is that text processing would become cumbersome on
account of the variable number of bytes for each Akshara. Before
the aksharas could be displayed, one has to identify the terminating vowel
associated with the consonant (or conjunct) and generate the shape
to be displayed. This is really the most difficult aspect of the
approach as it requires a complex algorithm to associate a variable
number of bytes to a shape that is either obtained through a single glyph
or built from multiple glyphs depending on the set of shapes (also known
as Glyphs) supported in the font. In the case of the Roman
letters, such a complex situation does not arise, for each byte is associated
with one glyph only.
From the illustration
above, it is clear that any system based on ISCII (or other eight bit representation)
has to keep track of a variable number of bytes for each akshara and what
is more, combine appropriate glyphs from the font to display the same.
In the case above, the first syllable is obtained by combining two
glyphs by overlaying the second on the first, while the second has
to be built from three glyphs by placing them side by side. Thus the
process of displaying the akshara from the internal representation is quite
complex. Worse still, it is language or script dependent since the
writing systems vary across the languages.
In spite of
this complexity, a system based on ISCII is implementable, though with
some difficulty. For a discussion on the problems
faced with the ISCII scheme of coding visit the corresponding pages.
It turns out that
UNICODE for Indian languages is also similar in concept to that of ISCII
(basically Unicode for Indian languages was derived from ISCII) with
minor changes effected. Consequently our discussion of ISCII also applies
to UNICODE.
Internal Representation: Approach-2
In
this approach, we use Roman letters or a short Roman string to represent
each vowel and consonant of the language. Each string will, in some
discernible manner, indicate the akshara it stands for. This will be a
variable length string representation but will consist of only the Roman
letters. Given below are some examples of the representation for
some vowels, consonants and some conjuncts as well.
We notice that this
representation helps more for data entry than display. In this specific
case, transliteration involves only the lower case letters and so typing
in the data should be relatively easy. The problem of figuring out the
glyphs from the strings continues to be relevant. However, for some fonts,
the Roman letters themselves correspond to the glyphs required. Of
course this works mostly for the basic consonants and vowels. Samyuktakshars
are still difficult to enter. In any transliteration scheme such as the
one above, there may be instances where one could see ambiguities, e.g.,
typing in the vowel "i" following an "a" will be construed as a single
vowel though in some cases two vowels may be desired (Gujarati has some
words where "i" will follow an "a"). Also a transliteration scheme will
have to use more than the 26 letters of the Roman alphabet to accommodate
the full set of vowels and consonants of the Indian languages and this
reduces the number of punctuation marks and special symbols which may be
used in the text.
This approach is the
Transliteration based representation for Indian language aksharas. Several
software packages have relied on this representation for preparing documents.
Starting with "DVNG", a package based on TeX, to produce quality
printed documents, there have been several transliteration schemes
proposed and used on the web. Most of these schemes have chosen the
strings arbitrarily and hence there is no common choice across the
languages. The section on Transliteration principles
explains the idea behind schemes.
It should be mentioned
that this approach continues to suffer from the problems of variable
length representations. Also, dictionary sorting order can not be maintained,
for sorting proceeds on the basis of the arrangement of the ASCII values
and not the order in which Indian language aksharas are placed.
Return
to Top
Internal representation: Approach-3
In
this approach to defining the character set and assigning codes, we identify
a set of all aksharas that have been in use across all the languages of
the country and assign unique codes to each akshara including Samyuktaksharas.
As seen in earlier sections, this set is of the order of thousands
of individual combinations and so the normal eight bit encoding will not
be adequate. A sixteen bit encoding is recommended.
The sixteen bit code
will work well, for it identifies the aksharas from a linguistic angle
as well. However the most unacceptable aspect of this, at least for
computer people, lies in the fact that no existing software will recognize
this and so the advantages of using conventional software such as
word processors will be lost. It is true that Unicode is technically a
sixteen bit code and many computer programs may (or will) recognize the
same. However, Unicode, as seen earlier, contains just about seven bit
information relating to the characters for the Indian scripts and so the
methods of handling Unicode will not apply to our 16 bit scheme. In other
words, the number of characters assigned to a script in Unicode is of the
order of 128 for most scripts and hence the associated fonts are expected
to support only that many glyphs.
Though this
proposed 16 bit code catering to a large number of aksharas will be the
right choice for processing text in Indian languages, the need to
write new programs for every meaningful application cannot be ignored.
It is
this problem that the development team at IIT Madras pursued in 1991. Fortunately,
a good solution has been provided. It is a solution
that combines the advantages of both eight bit and sixteen bit representations.
The solution is explained below.
During
data entry, a special processing module converts the entered data into
the sixteen bit representation. The entered text may be displayed using
conventional display methods which use fonts or the text may be displayed
using the special rendering module developed for this purpose.
Since display is based on 8 bit glyph codes, virtually all the methods
available to us for displaying text may be used. Also the 16 bit representation
may be converted into formats that are consistent with other word processors.
One such format has already become popular and is known as the rich text
format. Besides, the HTML format itself is universal enough to display
information through 8 bit font glyph codes and so the approach naturally
allows Indian language web pages to be created with ease.
Thus,
the complex problem of data entry and internal representation for the large
set of aksharas could be managed through special input routines that can
be added to many conventional software packages through the use of input
modules. All text processing may be attempted with fixed size 16 bit codes
and all results displayed by transforming the 16 bit representations
into glyph codes appropriate to the font used. This process iseasily accomplished
through a table lookup.
In essence,
the approach does not disturb the existing methods which deal with Indian
scripts but merely enhances their function by allowing uniform data entry
across all the languages. The linguistic requirements are also met.
The required conversion routines may be very easily written using
mapping tables and so dealing with ISCII, UNICODE or even transliterated
text becomes quite simple. The figure below illustrates the approach.
During data
entry, the sequence shown above is input. These are transformed into
two aksharas and stored. This unique representation may be converted
to other standard formats such as HTML, RTF, ISCII etc., using conversion
modules running as applications. The module shown as IIT's module
is a special library available on different computer systems to display
the characters without fonts.
It is thus apparent
that the 16 bit encoding of characters is meaningful in practice as the
interfaces required to work with 8 bit systems are all provided as part
of a package that works with the 16 bit code. Besides text processing,
the 16 bit codes may also serve useful purposes in generating sounds
corresponding to the akshara since each code directly relates to one sound,
the sound of a syllable. This has potential applications in
text
to speech systems in Indian languages.
In summary,
the 16 bit encoding for the aksharas will greatly alleviate the problem
of data entry and at the same time provide compatibility with existing
eight bit standards for displaying the scripts. Many different data entry
schemes may be supported based on the convenience to the user, since the
coding has no connection with the keyboard mapping assigned for the
vowels and the consonants. Text processing is also made simple since
a uniform 16 bit code is used for all the aksharas, permitting existing
applications to do the processing by slightly modifying the definition
and handling of the character. Even standard applications may be
used to handle the processing by allowing the applications to work with
ASCII strings obtained through suitable transformations on the 16 bit representation.
Sorting, searching, indexing and similar applications work well and virtually
any client server application may be modified to handle the longer characters
without much difficulty. The method allows perfect transliteration
across the languages and is thus well suited for preparing common information
for the different regions of the country.
To come back to Character
set for Indian languages, the answer to the question raised above is now
clear. There can be no character set for Indian languages
consistent with the concept of character sets for western languages.
The set of characters used for displaying text in an Indian script could
well form a character set, or for that matter the basic set of vowels and
consonants along with the symbols for the matras. Thus, a character set
may be defined purely for display purposes but from a linguistic angle
we draw a blank unless we consider the set of aksharas as a whole.
This set cannot be precisely defined since new aksharas can always be formed.
However, if we consider the manuscripts over a period of a few centuries,
it may be possible to accept a set of about 12000 aksharas as adequate
for linguistic work. If these are coded properly, text processing will
be relatively easy and existing tools for text processing may be used with
very little modifications to handle sixteen bit codes.
|
Main Issues
Character
sets: Do we have any for Indian languages?
Codes
assigned only for the basic vowels and consonants
Using
Roman letters to represent text in Indian langauges
Codes
for syllables
Important
Points to remember
Text representation
using codes which refer to the glyphs of a font are adequate for rendering
text in indian languages. Eight bit fonts have been used with much success
for publishing and printing High quality documents in all the Indian scripts.
Linguistic processing is
different in that one has to interpret the text string and effect some
processing. Searching for specific strings in an archive is an example.
To make linguistic processing
effective, one has to work at the level of a syllable. Getting a meaningful
standard at the level of a syllable is difficult, for the set of syllables
can become arbitrarily large. Yet, it is unlikely that one would require
beyond typically 7000-10000 syllables which have been written in the past.
It is preferable to have
codes of the same size for all the syllables to make linguistic processing
more effective. String Matching (regular expression matching) work
much better with fixed width codes.
Representation through transliteration
in Roman has been used with some success but this scheme uses diacritical
marks to distinguish between many of the aksharas. Plain Roman can be used
as has been shown in ITRANS but linguistic processing will continue to
be difficult.
|