Technical issues
in providing multilingual support within operating systems.
Basic
concepts
Language
enabling
Localization
Basic concepts.
During the past few
years, interest in working with multilingual documents has gained significance,
especially when it comes to information display on a web page. Besides,
software developers are always looking for ways by which the same package
could be made to work with different languages. An example would be a text
editor that may be used to prepare text in multiple languages all in the
same document. As of now (2005), the approach to generating multilingual
documents relies on the use of Unicode.
During the early years
of development of computers, user interfaces relied on text representation
using the simple ASCII approach. This way, the text displayed in a document
would consist of the letters from just one language. Applications
such as Web Browsers would examine the Content:type information specified
at the beginning of an HTML page to figure out the way the information
in the page should be interpreted. This concept is known as the Character
Set, where the numeric codes in the text corresponded to specific letters
of the character set of a language.
The character encoding
specifies how a given code should be interpreted. It must be remembered
that only the codes from 32 to 127 would have a standard interpretation
across different character sets. In particular, the codes corresponding
to the upper ASCII (128 to 255) are not ever guaranteed to be interpreted
the same way, even for a given character set, because applications
(typically web browsers) should know the intricacies of each character
set to display the letters. For example, an html document encoded in the
standard ISO-8859-1 character set will not render properly under Netscape
on a Macintosh system!
A test page to check
the rendering capability of a web browser is available at
http://www.utoronto.ca/webdocs/HTMLdocs/NewHTML/entities.html
Top
Language Enabling
For a computer application
to work with multiple languages at the same time, there must be a mechanism
to work with different character sets at the same time. Only recently,
after Unicode has gained good acceptance, applications have started dealing
with character codes which also include information about the character
set in use. Unicode had made it possible for the characters of most world
languages to be uniquely represented by assigning character locations for
each language. Text based on Unicode will thus be easy to interpret and
when dealing with a particular language, appropriate fonts could be selected
to display the letters. The computer application is also required to assign
keys for data entry so that text in different languages may be entered
without difficulty.
Conventional applications
which work with eight bit character codes will not be able to handle the
double byte coded text in a Unicode document. Hence it is necessary to
rewrite text editors or develop new versions which can handle Unicode.
Microsoft word is one of the applications capable of dealing with Unicode.
Language enabling
is a concept where the application will be able to allow data entry and
display in the required language by allowing dynamic selection of the language
during data entry. In an application that is enabled for a particular language,
what is seen on the screen or printed, will have text in that language.
With Unicode, an application would select the appropriate 'Locale' for
data entry in a particular language. The 'Locale' corresponds to a selection
within the Operating System to accept and display the text in the preferred
language. Different locales have to be installed in an Operating System
to cater to different languages.
Data entry may not
always be straightforward if the letters of the language bear no resemblance
to the Roman alphabet. Yet many applications may project a keyboard on
the screen and allow data entry through mouse clicks. In all these cases,
the current practice is largely, one-keystroke-one-glyph, where each glyph
shown on the screen corresponds to an individual letter of the alphabet.
This approach does pose difficulties for languages where the representation
of the characters involves combinations of two or more glyphs to display
a single conjunct character.
In case of languages
whose character sets do not figure in Unicode, there is a real problem.
However, one may always resort to a trick of telling the system that a
standard character set is being used but the system should use a specified
font for the display. This works well in practice so long as one can prepare
the document to be consistent with some character set. By and large this
has been the approach taken to displaying Indian language characters in
most computer applications.
Though Unicode does
allocate space for Indian languages, the limited 128 positions is practically
useless when it comes to dealing with literally thousands of aksharas.
So even displays on web pages resort to specifying the ISO-8859-1 encoding
and using almost 186 font glyphs to form the aksharas. This works well
under Windows and Unix where the ISO-8859-1 encoding is well honoured but
not on a Mac where many of the ISO-8859-1 characters do not get recognized
properly.
Language enabling
thus allows mere preparation of quality documents using the application.
Microsoft Word is known to excel in this for most world languages and some
Indian languages as well (with a difficult, non intuitive data entry method).
The data entry method is different for different Indian languages however,
being a consequence of differences seen in the Unicode assignments.
Top
Localization
Localization
is a totally different concept in which the entire interaction with the
application, including all the commands, is done in the specific language.
This calls for major enhancements to the system software to allow interpretation
of text strings in different languages. In an application that is localized
for a particular language, one may never see Roman text on the screen and
all computing, including naming of files, may be done in the specified
language. In other words, an application supporting localization for a
language can provide an effective user interface in that language. Thus
a person need not know English to run the application.
Localization is difficult
to achieve for languages which have large number of displayed shapes (Samyuktakshars)
such as Indian languages. This is a consequence of the fact that localizations
still rely the assumption that a small set of letters (128) is all that
will be encountered for text processing!
Localization has been
reasonably successful in respect of many world languages with small character
sets. The reason may be attributed to the short fixed length code (just
one byte) required to represent the letters of the alphabet. Even Japanese
and Chinese based applications, which have to deal with over twenty thousand
characters, have been successful only because people found ways of assigning
unique fixed length codes for the characters seen in these languages.
It is in respect of
Indian languages that localization has failed because no standards exist
for uniquely identifying the different aksharas using a fixed length code.
The problem is unique to the languages of the world in which the writing
system corresponds to syllables rather than individual letters. In a situation
where the writing system is not based on syllabic representation, the eight
bit (or two byte) fixed length code is more than satisfactory.
World over few developers
have succeeded in Localizing any of the applications to Indian languages.
This is a difficult problem to handle, since one has to process thousands
of Aksharas when interpreting a text string.
As
on July 2004, Only Microsoft Word and some Office 2000/XP products allow
language enabling and localization for some of the Indian languages. The
situation may change however! Localization using Unicode has had its share
of problems for Indian scripts. It turns out that the responsibility of
rendering text also rests with the application when variable length syllables
have to be processed. We have a page
devoted to this problem where a given document in Unicode is rendered
in totally different ways by different web browsers.
Top