The Syllable level coding scheme
Basic
principles underlying the encoding scheme incorporated into the IITM Software
are discussed in this document.
Introduction
The multilingual software
developed at IIT Madras employs a syllable level coding scheme for representing
text in Indian languages. This choice is based on linguistic considerations
as well as the fact that the writing systems are syllabic in nature, i.e.,
text written in any of the Indian Scripts follows the rules for writing
syllables and not the basic letters constituting the consonants and vowels.
The IITM scheme is
uniform across all the languages/scripts of India. A superset of basic
Aksharas has been identified and codes assigned for them. In simple terms,
an Akshara is a syllable and may consist of just a vowel, a consonant,
a consonant vowel combination or in general a combination of two or more
consonants and a vowel. Approximately 12000 Aksharas, found to be in use
across the languages, have been identified for this set. While it may be
argued that this ay be restrictive, the coding scheme does allow arbitrary
syllables to be represented using the generic forms of consonants, consistent
with linguistic requirements.
Each syllable is represented through a 16 bit fixed length code. This scheme
also provides for representations along the lines of variable length codes
for syllables as in ISCII or Unicode. However, the fixed length code allows
very easy text processing compared to the variable length schemes.
The Syllable Level Coding Scheme is a natural choice for languages which
employ a syllabic writing system. IIT Madras has proposed this coding scheme
as a single unifying method for computing with all the Indian languages
including Urdu. The scheme is open to question however and has been termed
proprietary by groups developing applications on the basis of Unicode or
ISCII.
The following sections
present the list of consonants and vowels which can form syllables. This
is a superset of the basic Aksharas from all the languages. Devanagari
script is used for illustration but where required, other scripts are also
used.
Each consonant and
vowel is assigned an individual number corresponding to its lexical ordering.
This numbering scheme is meant to identify consonants and vowels within
the set of the basic Aksharas. The development team had arrived at this
numbering after examining many issues relating to linguistic conformity.
In the Table below,
each Akshara has also been assigned a name (an English name) that will
be used to refer to the Akshara in a Computer Program ( similar to a name
assigned to a literal).
Top
of Page
Vowels
Sixteen vowels are included.
The last one in the list is the "null" vowel. This definition allows us
to treat a generic consonant as a syllable with a null vowel for purposes
of text processing. The null vowel plays the role of the "halanth".
Vowel 15, as mentioned
earlier, should be viewed as a "null" vowel. A consonant combining with
the "null" vowel is equivalent to its generic form. In the IITM scheme,
a consonant is always viewed as one with the first vowel "ah" as part of
it. This is strictly not correct from a linguistic point of view
since the generic consonant is defined as one with out a vowel. It is not
easy to pronounce a generic consonant by itself. Hence the convention that
the consonant is treated as a syllable with the first vowel "ah". The generic
consonant also gets viewed as a syllable when the null vowel is part of
it. This is a convenient representation. When writing a syllable, the rules
always permit writing it as a series of generic consonants except for the
last consonant in the syllable. This will become clear later.
Top
of Page
Consonants
In Indian languages, consonants
are grouped into sets based on the physiological basis for the production
of the sound the consonant stands for. There are basically seven groups.
consonants 39-42
apply in the case of Southern languages though "lla" is also seen in
Sanskrit, Gujarati etc. "nas" is specific to Tamil.
"ksha" (38) is actually
a conjunct but due to its high frequency of use, it has been assigned a
value. In the IITM system, this is treated as a consonant but handled by
special rules when lexicographic ordering is effected.
Top
of Page
Special
consonants
Apart from these 42,
four additional consonants have also been defined. These are not consonants
per se but viewed as pseudo consonants which can form syllables which may
correspond not to a sound but a shape. In all the writing systems, special
symbols are present but these do not have linguistic value. They often
represent notation used in accountancy, poetry, Vedic texts etc..
These four additional consonants
are named
"visarg" (43),
"music" (44), "vedic" (45) and "null" (46)
The actual use of
these consonants will become apparent later.
In addition to the
vowels and consonants, the scheme provides codes for 16 special marks which
are basically punctuation symbols. Two of the sixteen are used as "Anuswars"
and one is reserved for the "Avagraha" symbol. These 16 are reckoned as
special syllables formed by an imaginary consonant with the 16 vowels.
This imaginary consonant has been assigned the value 63.
Ten numerals have
also been assigned special codes. These distinguish the local numerals
from their ASCII equivalents. The ten numerals are viewed as special syllables
using the imaginary consonant mentioned above.
All the symbols seen
in the 96 character displayable ASCII have also been assigned codes. Each
Roman letter, punctuation or special character is viewed as a syllable
involving an imaginary "Roman consonant" This imaginary Roman Consonant
has been assigned a value 62.
Thus the basic set
of aksharas supported in the IITM scheme consist of 16 vowels, 42 basic
consonants, 4 special consonants, two imaginary consonants, one for Roman
letters and the other for special symbols.
Top
of Page
Representing
Syllables
The form
of a syllable can be any one of the following.
V - a pure vowel
C - a pure consonant
( generic consonant with "ah")
CV - a consonant vowel combination
CCV - two consonant conjuncts
CCCV - three consonant conjuncts
The IITM scheme caters
to all possible V, C and CV combinations and select combinations for the
CCV and CCCV forms. About 800 of these have been defined after studying
the syllables in use across all the Indian languages.
In the scheme, for
each base consonant C, at most 31 conjuncts can be specified and so the
number of syllables one can form with any one of the 42 consonants above
is limited to 31. This does restrict the number of syllables one can represent
through a single code (2 bytes). In practice however, this does not appear
to be a problem.
The IITM scheme does
not provide a single code for four consonant conjuncts and above though
many such conjuncts are in use. These have to be handled specially in ways
that also provide linguistic conformity.
Top
of Page
The
Syllable representation Scheme
Each syllable
is represented as a triple ( c, cj, v) where c is the base consonant, cj
is the conjunct part consisting of one or more consonants and v, the vowel.
The triple is accommodated in a 15 bit field divided into 6, 5 and 4 bit
fields as shown.
1
MSB |
6
(Consonant) |
5
(Conjunct) |
4
(vowel) |
The Most significant bit is not a part of the syllable. It is used to indicate
if the next fifteen bits actually represent a syllable or an attribute/escape
value. For valid displayable syllables, this bit is zero. When set to one,
the next 15 bits carry additional information about the language/script
to be used in the succeeding syllables.
The interpretation
of the 6 bit consonant field as well as the four bit vowel field is fairly
obvious. The intermediate 5 bit value needs some explanation.
Codes
for Samyuktakshars
For each base consonant
specified, one may list the set of syllables seen in normal use across
the languages. Up to 31 of these are assigned values. For many base consonants,
this set may be quite small, with as few as seven syllables. The specific
set of two and three consonant syllables starting with a base consonant
is lexicographically ordered and a number between 1 and 31 is assigned
to each combination. This process is best illustrated through an example.
Seen below is the
list of conjuncts starting with "ga". It should be kept in mind that the
syllables listed here are the ones for which codes have been assigned.
It is certainly possible that the list is not exhaustive and that other
syllables starting with "ga" have been omitted. The understanding here
is that the ordering conforms to the lexical ordering of the samyuktakshars.

We observe
that the triple directly allows us to see the base consonant as well as
the vowel. Inferring the consonant or the consonants in the conjunct part
requires a look up through a table.
When performing "Regular
Expression Matching", we can gain a lot of flexibility by masking the conjunct
part or the vowel part or both and identify strings that sound similar.
In other words, the IITM scheme is very well suited for regular expression
matching at the syllable level.
The full set of conjuncts
supported by the software is specified in a text file referred to as generic.cnj
and the complete set of basic consonants, vowels and the special symbols
are specified in independent files. These files are text files which are
used in generating the syllable level codes.
generic.vow
: The set of 16 vowels with their codes and the key stroke associated with
each vowel. The key here refers to the ASCII value of the Roman letter
that should be used for typing in the vowel.
generic.con
: The set of 46 consonants (42 which are linguistically significant, three
which are meant for special cases and the null consonant make up this 46).
The structure of this file is similar to that of generic.vow.
generic.spl
: The set of 16 special characters which may be typed in by way of punctuation
marks and special aksharas.
generic.cnj:
The listing of conjuncts which have been assigned codes. The list is presented
in the order of the base consonants. For any base consonant only 31 conjuncts
are allowed.
All the
above four files are pure text files (ASCII) and hence can be modified
to suit specific requirements without the need to recompile the application
dealing with syllables. These four files are required to be converted to
the appropriate data structures which will be read into the applications
from external files.
The software does not hard code the keyboard mapping for a vowel, consonant
or a special symbol. It is therefore possible to reassign the keys to suit
specific requirements.
The recommended keyboard
mapping is shown below. This is based on phonetic Roman equivalents, to
the extent possible with 26 letters and about 16 special symbols and punctuation.
Top
of Page
Use
of the null consonant
The null consonant
is useful for generating syllables which conform to specific display shapes
without disturbing the linguistic content. A syllable starting with a null
consonant will have the following triplet.
(46, cj, v)
with cj and v taking
their respective range of values. (46,cj,15) will correctly display the
consonant in cj through its half form in Devanagari derived scripts and
the smaller shape that appears below a consonant in the Southern scripts.
The halanth should be specified for a pure half form since the linguistic
equivalent for the half form is a generic consonant. In the Southern scripts,
the equivalent of the half form is the consonant which appears above and
the consonant appearing below will be the one that takes the vowel.
Just as we
have generic consonants, we can also have generic syllables, i.e., a combination
of consonants only. Such a generic syllable can form part of a full syllable
and the full syllable obtained by adding a vowel. Generic syllables may
be typed in a sequence to form arbitrarily long syllables. With some care,
such sequences can conform to linguistic requirements except when the writing
system changes the order in which the consonants are displayed. This is
seen mostly with "r" and the rules vary widely across the scripts. The
writing system used for Tamil employs only generic consonant shapes for
conjunct aksharas.
Null consonant with a
vowel
The null consonant
can take a vowel by itself and this may be used to represent the Matras.
One observes that while Matras by themselves do not have any linguistic
value, the standalone symbol for the Matra is required in practice,
if only to teach the rules of the writing system.
The roles
played by the null consonant and the null vowel are now clear.
The null vowel in
a syllable represents a generic consonant or a generic conjunct depending
on the contents of the two fields ( the 6 bit consonant part and the 5
bit conjunct part).
The null consonant
in a syllable is a representation of a generic consonant, specified by
the conjunct part. This is a provision made in the IITM coding scheme for
representing alternate display shapes for a generic consonant in a syllable.
Essentially, this may be viewed as a trick to generate suitable displayed
forms for arbitrary syllables, while maintaining linguistic content. Essentially
this amounts to composing syllable shapes.
It may be noted that
the set of syllables defined in the IITM software (where each syllable
is coded as exactly two bytes) is already comprehensive. Enough code space
is available for adding some more but this will result in incorrect display
of some aksharas in the text prepared with an earlier set of codes. Only
the Samyuktakshars will be affected however.
Top
of Page
Special Consonants Visarg, Music and Vedic
The three special consonants Visarg(code 43), Music(code 44) and
Vedic (code 45) permit the generation of special symbols such as Vedic accent marks, Musical notation etc. Also, the Visarg consonant is required in practice as a stand alone feature to handle syllables that already have a syllable.
The three vowels not included in the main set of vowels (long vocallic r, vocallic l and its long form) can be typed in as special syllables
using Visrag and the vowels ru, uh and ouh. It is true that the linguistic structure breaks when a syllable is composed like this but since the use of these vowels is quite rare, applications can remember to handle the situation. The special consonant Music was included to permit music notation to be handled by the software. Currently, this consonant is used to generate 16 special syllables for use with Braille.
Details of these special consonants may be seen in the documents included with the applications.
Some
general observations on the syllable level codes
Only 15 bits are used for each
syllable. The sixteenth bit specifies how the next 15 bits should be interpreted.
When the 16th bit is set to 1, the next 15 bits specify the script to be
associated with the syllables which follow till another switch occurs.
By and large, the coding scheme
maintains the correct lexical ordering of the syllables. In fact standard
sorting algorithms may be used without problems.
It should be remembered
that lexicographic ordering is not precisely defined for any Indian language.
There are several opinions on this. In practice it may be necessary to
map each syllable to an appropriately ordered value before sorting. The
algorithm for this is really quite simple since one is dealing with fixed
length codes.
Though the 15 bits provide for
as many as 32,768 syllables, only about 12000 are meaningful in practice.
The size of text is significantly
reduced in terms of number of bytes stored compared to other schemes involving
ASCII or Unicode. In the following examples, two or more representations
are given, where different transliteration rules are applied.
|