Data
Entry methods
Data Entry Methods suited
for Indian languages
This section deals with the
subject of preparing texts and documents in various Indian languages and
scripts, by using the standard QWERTY keyboard seen with most computers.
The answer
to the question "can we do it as simply as one does it for English?" is
an obvious NO but a qualified YES. The "no" part of the answer has
to do with the fact that the limited number of keys on the keyboard will
certainly not be able to cater to the thousands of aksharas which occur
in our texts. The qualified "yes" is based on the observation that
the keys may be used to represent only the vowels and consonants and thus
provide for inputting a series of consonants and vowels from which the
required aksharas may be formed using suitable computer programs.
The question
of data entry in Indian scripts had attracted the attention of scholars
and computer experts for many many years and today, one sees several different
computer programs which permit document preparation in different languages
and scripts. These programs, some of them very good in many respects, tend
to differ significantly in their approaches to data entry. The variety
seen in their approaches merits discussion so that we may better understand
the problems involved.
The programs
permitting data entry in many Indian languages/scripts may be classified
based on the specific approach taken to forming the aksharas from the keystrokes.
These are listed below.
-
Language/script specific data
entry which relies on a specific font.
-
Transliteration based data entry.
-
Data entry conforming to Manual
Typewriter keyboards, specific to each language.
-
Data Entry based on the INSCRIPT
layout
-
Data entry methods specific
to generating HTML pages supporting our scripts (web page creation)
-
Data entry based on uniform
mapping of the keys for all the languages/scripts.
Data
entry methods which are based on fonts.
The font based data entry methods utilize the feature supported by conventional
word processors where the font to be used for displaying the text may be
dynamically selected/changed during data entry. Today, the font rendering
capabilities built into the operating systems are quite sophisticated in
that the required shape of a character may be built from several primitive
shapes which are called glyphs. Each font may consists of about 200 different
glyphs, where each glyph may directly represent a letter of the alphabet,
a special character or a symbol.
In fonts for Indian
languages, the glyphs will invariably include shapes for the matras, special
samyuktakshars and special ligatures besides the basic shapes for the vowels
and the consonants themselves. When a font is selected, the word
processor will display the glyphs corresponding to the keys entered. For
the English language (Roman alphabet) each letter corresponds to only one
glyph in the font and data entry is smooth. In the case of Indian scripts,
we will have to know what keys will have to be entered to display the sequence
of glyphs which will make up one character. In the case of Roman, the set
of displayable glyphs correspond to the set of ASCII codes that are generated
when keys are pressed on the keyboard. This is a set of 96 characters
and anything more than this number will require special data entry, as
keyboard has only a limited set of keys.
Conventional word
processors are designed for languages where a letter (or a character) to
be displayed is associated with one glyph only. Also, for most of the western
languages, the character set itself is limited and so the set of displayable
characters is well within the 96 mentioned above. Even though a font for
western languages may need to accommodate only the displayable set, many
glyphs in the font may be present that are not displayed when the keys
are pressed during regular data entry. That is, there may be glyphs in
a font which are displayable but not necessarily shown when keys are entered.
These glyphs usually correspond to characters with accent marks, specialized
symbols, diacritic marks etc., and may be required mostly in printed text
and special applications. Some word processors do support data entry
for these glyphs which are typically located in the upper ASCII range (160-255)
by allowing the numeric value of the glyph location to be input with the
ALT key kept down as the numeric value is typed in.
Fonts for Indian languages
(Except Tamil) are required to have many more than 96 glyphs and so, data
entry based on this method of inputting the numeric glyph code values and
displaying the character, will become necessarily cumbersome. Worse still,
the input sequences are font specific and will vary from font to font even
for a given script. Fonts for Indian languages had
evolved arbitrarily and do not follow any standards since none exist.
Consequently one sees wide variations in the glyphs themselves as well
as the encoding for the font which locates the glyphs at specified locations
in the table of 256 locations. as of today (March 1999), there are
no standards for glyph locations for Indian scripts and it is likely that
such standards may not be possible at all.
For a basic
discussion
of fonts and the issues to be considered in designing Indian language fonts,
the viewer may look at the relevant section within these pages.
A point
to keep in mind is that the internal representation of the text prepared
according to this method is in the form of eight bit glyph codes. This
has serious consequences if one were to attempt any sort of string processing
of the text because the glyph codes bear no relationship whatsoever to
the linguistic nature of the aksharas in terms of lexical ordering, sorting
or indexing etc,, Yet, this font based data entry method is popular with
DTP packages, where one is interested more in printing text as opposed
to linguistic analysis.
There is however a bright side to this approach. Though keyboard entry
is cumbersome, one might effectively use the cut and paste facilities supported
in the word processors to perform some editing of the entered text.
In some word processors, one also sees an image of the keyboard with aksharas
and matras assigned to the keys and the user may simply click on the keys
to select the glyph to be displayed. Also if the user were to keep a standard
file containing the glyphs, then individual glyphs may be cut and pasted
even for entering short sentences of text. Some Urdu word processors have
this feature.
It must be emphasized that data entry on the basis of fonts and glyph codes
cannot really provide a natural interface, even if supported through sophisticated
macro facilities found in some word processors. You may want to
try inputting the following multilingual text using your favourite word
processor or DTP program.
Well there ought
to be an
easy
way of doing this!
Return to
top of page
Transliteration
based data entry methods.
Transliteration
has been a popular approach to preparing printed documents in different
Indian scripts. The idea behind the method is to use Roman letters
to represent the aksharas of the languages and process the resulting string
(ASCII text) using special computer programs, to produce printed output.
The output is obtained using appropriate fonts.
One of
the early computer programs to successfully implement this idea is the
Dvng
processor for Devanagari using TeX. This program produced a TeX file
which could be typeset using the TeX program. Franz Velthuis who devised
this package, had also included a special Devanagari font for use with
the package. The Dvng package ran on Unix systems and TeX fonts have the
advantage that nearly every glyph in the font (which may have as many as
250 glyphs), may be used in printing. In contrast, fonts for other systems
such as X-windows, MSWindows, PostScript etc., are restricted to just about
200 glyphs. This is not a design limitation of the font but a problem
arising out of the inability of application programs and font rendering
routines to look at specific glyph locations. As a consequence of
the rich set of glyphs, the Dvng package could print a rich set of conjunct
characters in Devanagari.
After Dvng, Charles
Wikner enhanced the fonts to accommodate Vedic symbols and also gave
a new processing package. As of today, the Devanagari output obtained
using this package is of remarkably high quality and Wikner's choice (or
design) of the glyphs has allowed nearly a thousand different conjunct
formations to be derived from the basic set of about 250 glyphs.
Both the
packages mentioned here had arrived at some guidelines for standardization
in the selection of the Roman letters for the aksharas of Sanskrit.
In many instances, special symbols from the ASCII set were required to
be used to distinguish similar sounding aksharas. Printout using these
packages were restricted to Devanagari but Roman could be part of the text
as well, permitting bilingual outputs. Subsequently TeX based systems were
introduced for Tamil,
Telugu, Malayalam, Gurmukhi, Gujarati and Bengali.
Following the success of the TeX based packages, Avinash Chopde developed
a special transliteration package that allowed other scripts to be handled
as well, via language specific fonts. His ITRANS
package is well known on the web. Subsequently he enhanced the package
to work with normal fonts under Windows-95 and X-windows and was able to
generate html documents for display on the web. The most recent version
of ITRANS supports quite a few languages.
It must be remembered that all transliteration based data entry methods,
require a computer program to generate (as well as format) the output and
hence they cannot be applied or used for interactive data preparation,
where the display in Indian scripts immediately follows the key strokes.
The ITRANS package was followed by JTRANS,
a Javascript based program by Sandip Sibal who allowed quick generation
of html documents from transliterated inputs. This package introduced Xdvng,
a quality font for Devanagari which could be used for viewing web pages
with Devanagari text both under MSWindows and XWindows. Sibal's package
is restricted to Devanagari however.
The Itranslator
package from Onkarnath Ashram in Rishikesh allows data entry in ASCII
using the ITRANS scheme but allows the string to be converted to Devanagari
and displayed on the screen itself. The font used by this package is probably
the finest of the freely available fonts for Devanagari and is known as
Sanskrit_1.2. Unfortunately, this font is suited for the Windows platform
alone and has glyphs in locations that create problems on other platforms.
The more recently announced version of this font (Sanskrit-98) seems to
avoid the above problem. Please the web site referenced above for recent
additions to the Itranslator package, including new fonts.
Transliteration schemes
for Tamil and Telugu.
There
have been a few popular packages for Tamil and Telugu which use the transliteration
based data entry method. The Adhami package was written for use under DOS
and subsequently enhanced to work under MSWindows and produced displays
and printouts in Tamil. Other transliteration schemes such as Mylai and
Cologne were also popular with Tamil. For Telugu, the RIT package developed
by Rama Rao Kanneganti, used TeX for typesetting the output. Details
of some of the transliteration schemes may be found in our pages on transliteration
principles.
Universal transliteration
scheme for Indic scripts.
Recently, Dr.
Anthony P. Stone has recommended a special transliteration scheme
to handle all the Indian scripts. This interesting proposal uses eight
bit character codes to represent the vowels and consonants and hence maps
a fairly large superset of vowels and consonants from all the scripts
of interest. This is a meaningful proposal but has only one likely limitation.
Existing data entry facilities do not permit easy typing of characters
in the upper ascii range (160-255) and so data entry using this scheme
may not be feasible, as of now. However, it is quite easy to display all
our aksharas using this scheme. Therefore printouts of our texts in transliterated
form, may be easily generated. Standardization of transliteration will
help considerably in dealing with Indian languages in a uniform manner.
Summary of Transliteration
based data entry.
1. This method
allows text in Indian languages to be input using Roman letters. A special
computer program is used to process this text in Roman to produce printouts
or displays using appropriate fonts for the scripts. There are several
transliteration schemes in use. Most of the processing programs run under
Unix.
2. Transliteration
schemes are often specific to one Indian language/script. There is no single
scheme yet that correctly handles all the Indian languages.
3. Phonetically close
Roman letters may not be found for all our aksharas. So some compromise
is required in selecting the Roman letters. Also multiple representations
for the same akshara seem to be allowed, making the processing somewhat
complex.
4. It is possible
to confuse most of the processing programs by inputting arbitrary formations
of conjunct aksharas.
Transliteration
based data entry is a workable solution for Indian scripts, since in principle,
it allows for a uniform data entry mechanism for all the languages. The
transliteration scheme should be comprehensive enough to handle all the
aksharas across all the languages/scripts.
Will it be meaningful
to have a system where, as one types in the transliterated text,
the actual characters of the Indian script appear on the screen? This is
what is being attempted by some of the recent applications which work under
Microsoft Windows systems. While this is an interesting development, the
transliteration schemes used are often language specific and may not always
permit the formation of many complex conjuncts (Samyuktakshars).
Return to
top of page
Manual
Typewriter Keyboard based data entry.
Manual typewriters for different
Indian languages have been available for quite some time and their use
in Educational institutions and Government offices is substantial. Manual
typewriters provide for a minimal set of aksharas consisting of the basic
vowels and consonants together with the matras so that text can be prepared
conforming to the writing system for the language. The location of the
keys for the vowels and consonants on a regional language typewriter is
specific to the language. Many are adept at using such typewriters
and when they have to move over to using word processors, they would rather
see the same keyboard mappings. Some word processors do indeed provide
for data entry based on the typewriter based key mappings. The resulting
text may not include a number of conjuncts but will be entirely adequate
for normal modern day correspondence.
Data entry based on the INSCRIPT
keyboard.
The INSCRIPT keyboard
allows more or less uniform data entry of text across the different scripts.
The mapping provides for the data entry of vowels, consonants and matras
consistent with the specifications in ISCII. The INSCRIPT layout utilizes
only the keys provided on a standard QWERTY keyboard and is hence implemented
easily on personal computers. It may be observed that a number of keys
normally used for punctuation or special symbols are also mapped to the
ISCII characters. It will therefore be difficult to perform data entry
of text along with a full complement of punctuation marks which have come
to into use with almost all the scripts. Microsoft applications also use
the INSCRIPT layout for Unicode data entry and hence suffer from this problem.
The Microsoft Hindi keyboard has apparently provided for many punctuation
marks but one has to effect multiple keystrokes to enter them. Shown below
is the INSCRIPT layout on a QWERTY keyboard. Keys corresponding to the
ISCII characters are common across all the scripts.
Return
to top of page
Special
programs for Web page creation.
During the past several
years, display of Indian language text on the Internet (Newspapers and
Magazines) has become popular. Web pages in Indian scripts are feasible
on account of the fact that web browsers may be asked to display a given
text in a specified font. We have included some useful information on this
in our section on setting
up web pages supporting Indian scripts.
The html standard
provides for an interesting way of specifying the glyphs to be displayed
either through the numeric code assigned to the glyph or the universal
name assigned to that glyph location consistent with the font encoding
that has now become standard. This way, the html language also functions
as a macro language, where a text string describing the glyphs to be shown
may be just typed in using standard ascii. While one may not need
to worry about this for glyphs in the displayable ascii range, the approach
is very useful for glyphs in the upper ascii range. In lighter vein,
some people on the net refer to this as the method for the "ASCII impaired"!
The advantage of this approach need not be emphasized, for virtually any
text editor capable of data entry for the upper ASCII characters can be
used to produce web pages in Indian languages, provided one has patience!
As an example, the html document
shown below will produce the display given in the image that follows. <
and « represent two glyphs that are specified through their name
entities.
<html>
<center> View the source
of this document to see how name entities have been used in preparing the
Devanagari string seen below <br>
<font face="sanskrit
1.2"> s<Sk«tm! </font>
</center>
</html>
The user preparing
the html document must necessarily know the location of the glyphs. This,
as we know is font specific, even if the font is meant for a specific script.
In a sense, generating display
through html is similar to the macro based approach taken by TeX, the typesetting
program developed by Dr. Knuth. While TeX has the advantage of using most
of the 256 glyphs in a font, html displays are constrained to using only
about 200, thus loosing the ability to display some conjunct letters.
Web Pages supporting display
of Unicode Text.
Unicode has been accepted
as a meaningful standard for handling multilingual text .Most browsers
introduced after 2002, include support for this. With Unicode text, the
method indicated above does not apply, for the encoding standard automatically
identifies the font to be used. Unfortunately, rendering Unicode text in
Indian languages is beset with multitudes of problems and it is unlikely
that correct rendering of text will be realized. Unicode text in Indian
scripts will have to be created using appropriate programs such as Microsoft
Word and related applications. As of this writing (April 2005), several
browsers cannot correctly display Indian language text represented through
Unicode. The interested reader may visit the pages at this site where the
difficulties
encountered in dealing with Unicode for Indian languages is explained
in greater detail.
Return to
top of page
Phonetic
mapping of the vowels and consonants.
One way of looking at data
entry in Indian languages is to view the text as consisting of aksharas
that can always be decomposed into vowels and consonants and perhaps a
few symbols. In this phonetic approach to data entry, just one key stroke
is associated with each vowel and consonant and a computer program (typically
an input module in an application) keeps track of the keystrokes and forms
the aksharas. In many ways, this approach is similar to the transliteration
based data entry except that we are not constrained to mapping the vowels
and consonants to any specific keys. Also, in the transliterated input
case, more than one keystroke may be required to form a vowel or a consonant
(e.g., an aspirated consonant or a diphthong).
The Inscript keyboard
layout ( the recommended standard for ISCII based systems) follows this
approach though it includes keystrokes for the matras as well. Since the
addition of a matra to form a consonant vowel combination is not uniformly
applicable to all cases (in Tamil and Malayalam, the combination with the
vowel "u" changes the shape of the consonant), the Inscript keyboard does
not correctly indicate or reflect what would happen when a combination
is input. However it may be assumed that the key for a matra does not always
result in a matra but may change the shape of the consonant. The inscript
keyboard basically confirms that a phonetic approach to data entry is feasible.
True, the basic requirement here is that the input module must process
each keystroke taking into consideration the previously entered keys and
also check if the conjunct is valid or meaningful. But this is a module
that can be written once and incorporated into an application program,
to work uniformly across all the Indian languages.
The data entry scheme
recommended for the IIT Madras software essentially follows this approach
with one additional facility. The CTRL key (or an equivalent) is
used to indicate that a combination is required to be effected with the
previously formed akshara and the current input. Thus the user explicitly
indicates that a conjunct will have to be formed. This feature is
helpful in situations where consonants and vowels not present in a language
are attempted to be input. The system will not accept such inputs thus
providing a safeguard that only valid combinations may be input.
In the phonetic approach,
the key mappings do not relate to the generic consonants i.e., a consonant
without any vowel. The mapping relates to the form of the consonant where
the first vowel "ah" is assumed to be present. This is often the
way the consonants are taught for children. This way, only one keystroke
will be required to enter the most frequently required form of each consonant,
as opposed to the case with transliteration based data entry where two
keystrokes will be needed.
|