Software Design
Issues
Multilingual Computing with
Indian languages
Basic
issues
The term "Multilingual
Computing" refers to the use of computer applications in Indian languages.
Traditionally, computer applications were based on English as the medium
of interaction with the system. In India, when one attempts to use computers
for education and literacy, one faces the problem of language where majority
of the population that should get the benefit of Information Technology,
does not speak English.
Elsewhere in
the world, computer applications have been developed in different languages,
appropriate to the user communities in different countries. It is seen
that application development relies on user interfaces which display information
in the script that is relevant to a user. The scripts used for text displays
have generally been simple scripts based on the letters of the alphabet,
typically running into just about a hundred different symbols or shapes.
Coding such information has been relatively straightforward.
The writing
systems in use in the South Asian Region of the world are based on syllable
representation and for this reason, it has generally been difficult to
develop user interfaces supporting them. Multilingual text representation
has however been made possible through Unicode, the scheme that supports
representation for the scripts of the world so that computer applications
could really be multilingual in terms of user interaction.
It turns out
that it has not been easy to adapt the methodologies suggested by Unicode
to text in Indian languages on account of the complexities of the writing
systems. While in principle, methods have been suggested for handling syllable
level information through multibyte Unicode strings, practical difficulties
arise in developing applications based on Unicode (or for that matter ISCII)
where text or linguistic processing is involved. Basically the problem
has to do with the issue of internal text representation which has to be
necessarily different from the representation for rendering the text. Technical
solutions have been recommended to handle this but one cannot assert that
the results have been satisfactory.
At the Systems
Development Laboratory, IIT Madras, the view is that for computer applications
to be really meaningful, text processing with Indian languages must be
attempted at a syllable level, consistent with the rules of the writing
system. The project undertaken in the lab has emphasized the need to develop
solutions that are universal in terms of applicability across the languages
of India. There are several technical issues to be considered as also the
viability of solutions for implementations that will gain acceptance by
the users. The discussions in the pages of this site highlight the
issues involved in computing with Indian languages.
Rendering
Text
An interactive computer
application should permit data entry and display in the script of interest
to the user. With writing systems that are syllabic in nature, data entry
has to ultimately lead to the formation of syllables, if needed, through
multiple key strokes. The problem of text rendering deals with approaches
that can be taken to represent syllables in such a way they can be displayed
correctly.
In most computer applications today, the internal representation involves codes for each
key stroke and hence the syllable is specified through a variable number of codes. From
a variable length code (internal representation) one has to generate the
display using appropriate fonts. Dealing with variable length codes is a difficult problem. It is for this reason IITM has recommended the use of fixed length syllable level codes for text. The problem of text rendering
is discussed in detail in the section on Electronic Representation of Text.
Return to
Top
Encoding
Standards
Over the years, different
approaches have been taken to represent syllables for internal storage.
Many of these, relied on the availability of a specific font designed to
accommodate the shapes from which syllables could be composed (built).
In 1991, the ISCII standard was adopted for general use and subsequently
Unicode for different Indian scripts became a standard. Salient aspects
of these encoding standards are discussed in these sections.
ISCII
and Unicode standards are discussed in a separate section.
Font Standards
In simple terms, a
font provides for displaying a set of symbols through well defined shapes
for each symbol. The symbol is a generic concept and the font is an instance
of specific representation of a set of symbols. Traditionally, the
symbols mentioned here have been the letters of the alphabet in a particular
language along with punctuation marks and special characters. Fonts used
to be created by craftsmen and artists during the days of printing machines
that used movable type faces. Today, fonts are created by artists and designers
who work with computer based tools.
Inside a font,
the specific shape for a symbol is described either in terms of a digital
image through bit maps or in terms of a filled outline. The former is called
a bit mapped font and the latter, an outline font. Outline fonts are increasingly
being used on account of their scalability. The descriptions result in
a pictorial representation or shape for each symbol, which is referred
to as a glyph. Most fonts have a provision for describing up to 256 different
glyphs, though in practice only about 190 - 240 may be present. Text mode
displays on computers (Dos command shells or Unix command shells) use bit
mapped fonts while outline fonts are used with Graphical user interfaces.
Font standards have
evolved over the years and apply to the scripts of different languages.
Fonts with a restriction of eight bit codes for selecting the glyphs cannot
support multiple scripts at the same time. Traditionally, to allow a given
8 bit code to refer to a shape, the concept of Font Encoding was employed.
The concept behind font encoding permits a shape to be identified with
a name (may be even different names). Thus the name for a shape is mapped
to a geometrical description of the shape. Such mapping implies that the
code for a given name is located through a table mapping names to codes
and the resulting code used as an index into the set of glyphs to select
the shape appropriate to the name.
Fonts are typically
designed to support specific encoding schemes. While rendering text generated
form a computer application, the code for the text is first mapped to a
name and the shape corresponding to the named character is located through
the internal name to glyph mapping in the font. Web browsers allow the
choice of specific fonts for specific text encoding schemes which typically
differed across platforms. This will permit text in a web page to be displayed
properly using an appropriate font for which support is provided in the
system on which the browser is running.
Return to
Top
Fonts
for Indian languages
Fonts are inherently
proprietary in nature and tend to be incompatible across computer systems
as well as applications. It is true that the internet is a repository where
many fonts are freely given by their designers. However, the design of
a font, commercial or freely given, deals with issues of rendering of complex
conjuncts and syllables when one thinks of Indian languages/scripts. Today,
it appears that very few non Roman fonts are available for practical use
which are supported under all the important platforms. This has imposed
fairly severe restrictions in respect of web displays of Indian Language
text on account of the totally arbitrary approach to designing the fonts.
True, these fonts were not designed with the idea of text processing but
more for getting good printouts. However when we use the same for web pages,
we run into many incompatibilities.
In text editors and
word processors, the internal representation of text is usually the ASCII
code (possibly Unicode today) of the letter displayed and these codes happen
to be the same as the numeric codes assigned to the glyph locations containing
the letters of the alphabet in standard fonts for the western languages.
When it comes to fonts for Indian languages, the display has to be built
up with more than one glyph for many aksharas and hence the internal representation
of the aksharas is purely a function of where the glyphs for the aksharas
are located within the font. Thus, one faces the problem that, the stored
text is not in a format that can be viewed on different computer systems
because, the encoding standard may not be supported in each system. Also,
glyph codes are the choice of the font designer and will bear no relationship
to the ordering of the aksharas in our scripts. Thus, linguistic processing
of the stored text is a formidable task, being font dependent even for
the same script.
Thousands of fonts
have been designed for Indian scripts but each design has its own specific
purpose, often compatibility with a typing scheme. Applications dealing
with Indian scripts have generally relied on the availability of a specific
font(s) for a script. Consequently applications could not move data transparently
across platforms since encoding issues come into play. The problem continues
even with Unicode where some standardization has been effected for text
representation because text rendering is separated from internal representation
(strictly not true for Unicode).
Unlike a web
page in English which could be seen through a substitute font, if the specified
font were not available, text in Indian languages will require the specified
font to be present. This is the reason why web pages often make available
for download, the font associated with the displayed information. Unfortunately,
a single fonts cannot be used across platforms and even a given font is
not guaranteed to display text properly if the internal encoding differs,
which is usually the case with many indian language fonts.
Fonts
for Indian languages are discussed in a separate section.
Return to
Top
Language enabling
Language enabling
is a concept where a computer application will be able to allow data entry
and display in the required language by allowing dynamic selection of the
language during data entry. In an application that is enabled for a particular
language, what is seen on the screen or printed, will have text in that
language. Data entry may not always be straightforward if the letters of
the language bear no resemblance to the Roman alphabet. Yet many applications
may project a keyboard on the screen and allow data entry through mouse
clicks. In all these cases, the current practice is largely, one keystroke,
one glyph, where each glyph shown on the screen corresponds to an individual
letter of the alphabet. This approach does pose difficulties for languages
where the representation of the characters involves combinations of two
or more glyphs to display a single conjunct character.
Language enabling
methods rely on switching keyboard input for the entry of text in different
scripts. Current approach is to effect this switch through services in
the Operating system (typically called Locales). It often happens that
one is required to switch locales to permit text input in respect of punctuation
or special symbols that are often required in a script but these symbols
may not be part of the traditional writing system using the script.
This happens quite frequently with Indian scripts which today, employ standard
punctuation. Often the keyboard assignments are tight and one may not be
able to accommodate the newer symbols unless the locale is switched or
multiple keystrokes are employed even for simple punctuation. In other
words, keyboard switching becomes inevitable when multilingual text is
to be entered and the required locales have to be included in the OS for
this to work properly.
Return to
Top
Localization
Language Localization
is a totally different concept in which the entire interaction with the
application, including all the commands, is done in the specific script
for the language. This calls for major enhancements to the system software
to allow interpretation of text strings in different languages. In an application
that is localized for a particular language, one may never see Roman text
on the screen and all computing, including naming of files, may be done
in the specified language. In other words, an application supporting localization
for a language can provide an effective user interface in that language.
Thus a person need not know English to run the application.
Localization
is difficult to achieve for languages which have large number of letters
such as Indian languages. This is a consequence of the fact that localization
of applications still rely the assumption that a small set of letters (128)
is all that will be encountered for text processing! It turns out that
while one sees improvements in rendering multilingual text, the interpretation
of the text string continues to pose problems. The real problem is that
of having to work with syllables for the purpose of interpretation while
the rendering of text has to do with the shapes of the written characters.
Return to
Top
Unicode
Unicode
for Indian scripts
The generic
concept of Unicode works well for the western languages where there is
only one shape associated with one and only one code value. That is, each
code value can directly refer to a glyph index and when the glyphs are
placed side by side, the required display is achieved. In this case, a
text string is rendered simply by horizontally concatenating the shapes
(Glyphs) of the letters. Thus a Unicode font for a western script need
have only one glyph for each character code. The Glyph index and the code
value can therefore be exactly the same. When the glyph indices are given,
the original text is also known exactly due to the one to one mapping.
Most languages whose writing system is based on the Latin alphabet come
under this category.
This simplistic view
does not help when the displayed shape does not correspond to a single
letter but relates to a group of consonants and a vowel which constitute
a linguistic quantum. In the South East Asian region, writing systems are
based on rendering syllables and not the consonants and vowels. The accented
characters mentioned earlier may also be viewed in this light as being
made up of two or more shapes derived from two or more codes.
The problem at hand
in respect of Indian languages is one of finding a way to display
thousands of such combinations of basic letters where each combination
is recognized as a proper syllable. This corresponds to a situation where
a string of character codes map to a single shape. In the context of Indian
scripts, the code for a consonant followed by a code for the vowel will
usually imply a simple syllable often rendered by adding a matra (ligature)
to the consonant, though there are enough exceptions to this rule.
Those responsible
for assigning Unicode values to Indian languages had known about the complexity
of rendering syllables. But they felt that the assigned codes correctly
reflected the linguistic information in the syllable and so suggested that
there was no need to assign codes to each syllable. It would be (and should
be) possible to identify the same from a string of consonant and vowel
codes (Just as syllables are identified in English). What was specifically
recommended was that an appropriate rendering engine or shaping engine
should be used to actually generate the display from the multibyte representation
of a syllable.
Since Unicode evolved
from ISCII, there was also the special provision of Unicode values to specify
the context in which a consonant or vowel was being rendered as part of
a syllable. In other words, Unicode also provided for explicit representations
achieved by forcing the rendering utility to build up a shape for a syllable,
different from what might be a default. Thus Unicode for Indian scripts
does not strictly separate the rendering from the internal representation
and provides codes which specify the context for rendering. This bias,
exhibited by Unicode can cause enough headaches for developers when Unicode
text has to be processed.
Limitations
Limitations with Unicode
are seen more in respect of text processing than text display. The nature
of the writing systems followed in India require multiple display forms
for a given syllable and this cannot be provided easily, for the onus is
on the programmer to check if the required display form can be generated
using the given font. Hence the application is influenced by what the font
can offer. A direct consequence of this is that applications across platforms
will not be in a position to utilize a standard rendering approach resulting
in incompatibilities across applications and platforms.
A detailed discussion
of Unicode for Indian Languages
and the incompatibilities observed in standard applications has been included
in an independent section of this site.
Return to
Top
Data Entry
methods
The specific problem
discussed here is the use of the standard QWERTY keyboard to keyin
data in Indian scripts. This is a fairly well understood problem and several
solutions are available. Data entry rules should be easy to follow while
at the same time permit the formation of complex conjunct aksharas consistent
with the rules of the writing system. In respect of Indian scripts, one
sees additional requirements brought about by the use of punctuation marks
from the Roman Script. Please read the discussion in the section on Data
entry methods suited for Indian languages.
Return to
Top
Transliteration
Transliteration
has been an important approach to displaying text in Sanskrit, Tamil and
other languages using equivalent Roman letters with suitable diacritical
marks. Use of transliteration simplifies text processing if only the Roman
letters are involved. In fact TeX has taken this further to permit description
of the syllables so that a preprocessor could be used to identify the manner
in which the syllable could be composed.
Transliteration can
be seen in books printed about Indology during the early days of printing
in India when typefaces for Indian scripts had not come into use. Unfortunately,
there have been no standards in respect of the choice of Roman letters.
Today, several schemes are in use each having specific merit for specific
scripts.
Transliterated text
is more amenable to linguistic processing using conventional text processing
algorithms using ASCII.
The Acharya site offers
detailed information on Transliteration
and the IITM software includes utilities for converting transliterated
text to a format suitable for further processing with the IITM software
tools.
Return to
Top
Application
development
Interactive
applications
The IITM software
was primarily written to support effective user interfaces in Indian languages
for different applications. The software is very well suited for developing
web based interfaces for Indian scripts. The multilingual editor is the
base for many other interactive applications to support Indian language
text entry into a computer. Please look at the page describing the applications
developed as part of the IITM project.
Syllable
level Codes
Syllable level codes
simplify text processing since existing algorithms for fixed length codes
could be utilized. While it is true that applications based on Unicode
representation of text in Indian languages have been implemented with some
success, the basic problem of uniformity across applications (and platforms)
continues. The real issue here is that applications which process Unicode
are required to handle specific rendering of text. With syllable level
codes one does not see the problem. Though one can legitimately ask if
syllable level codes can be standardized for arbitrary syllables, it turns
out that virtually every syllable encountered in practice can be handled
using a superset of syllables across the languages. Detailed discussion
of the Syllable level coding scheme is provided in the section relating
to the IITM Syllable level Coding scheme.
Local
Language Library
The Local Language
Library is a set of functions which may be called from an application program
to perform input, output and string processing of text in Indian languages.
These functions are similar to the ones in the standard C-Library but work
with Syllable level codes. Also, the functions supported have some features
like those supported under the curses library used under Unix for text
rendering. The Local Language Library is universal in the sense that it
supports calls which are language/script independent. Hence, applications
built with the library will work transparently across the languages and
the required script can be selected through parameters when the application
is invoked. The functions work on the basis of fixed length syllable level
codes and hence one can work with standard algorithms for string processing,
regular expression matching etc..
The Library
is documented in a separate page.
Text
to Speech ( Software for the disabled )
The availability
of the computer opens up several possibilities for helping the disabled
gain basic educational and professional skills. Many countries of the world
have provided very useful applications suited to specific disabilities
and these have won appreciation from experts all over the world. Unfortunately,
most of these applications require knowledge of English. The IITM software
development team had already envisaged the need for many of these applications
to support Indian language user interfaces. This is a technically challenging
proposition which at the same time can provide many opportunities for the
rural masses of the country to get basic education and through that, employment.
The technical issue ultimately serves a social cause and one needs no further
incentive for taking up the project. As on April 2002, the Systems
Development Laboratory has successfully developed speech enhanced applications
for use by the Visually handicapped. Details are available in the corresponding
pages.
Return to
Top
PERL
Modules
The fixed size two
byte encoding used in the IITM software lends itself to direct manipulations
using PERL which is a remarkably good choice for writing applications which
would interpret scripts written in Indian languages. Very little is required
by way of enhancements to standard PERL which handles regular expressions
with great ease and simplicity.
The enhancement required
in PERL for this is a simple module which can present "llf" characters
(an llf character corresponds to one syllable) as equivalent ASCII strings.
Such a module has been developed in the lab and is known as "llperl". This
module provides support for processing text prepared with the IITM multilingual
editor or any application which can generate syllable level codes consistent
with the IITM coding standard. The idea behind this approach is to permit
PERL programs to be written using the IITM editor where text strings in
Indian languages could be present.
Details of the PERL
Modules are available in the linked page.
Return to
Top