IITM
syllable Encoding applied to Arabic, Urdu and Hebrew.
As seen before, the
IITM encoding scheme operates at the level of the syllable. Each code is
expressed as a triplet and the same is mapped into three independent
fields together making up 15 bits. Hence any writing system which is syllabic
in nature could be handled well with this encoding scheme.
Arabic, Urdu, Hebrew
etc., employ writing systems which are syllabic in nature. The associated
scripts are written right to left. These scripts (specifically Arabic and
Hebrew) are characterized by some unique aspects of their writing systems,
viz., the absence of specific forms for vowels. So one sees only the consonants.
The vowels are shown through a carrier symbol though it is not strictly
correct to call them vowel representations. In Arabic the carrier is the
familiar "Alif". In Hebrew, the carrier is the "Alef". The medial vowel
representations are usually in the form of short strokes in Arabic and
a set of dots arranged in different patterns (called points) in Hebrew.
These points or strokes usually appear above or below the consonants. The
long vowels are usually represented through the addition of the consonant
"ya" for "ie" and "va" for "ouh".
The most significant
aspect of the Arabic or Urdu writing system is that each consonant is written
with different shapes depending on whether the consonant appears standalone,
at the beginning, in the middle or at the end of a word. Thus, there are
four possible shapes for a consonant though about six of the consonants
have only two. The syllables are generally connected together in a continuous
fashion except when these six occur. A syllable is invariably a simple
consonant vowel combination and may admit of only consonant doubling. There
are no conjuncts (similar to the Samyuktakshar).
The figure below illustrates
a line of text printed in Arabic. One remarkable aspect of the writing
is the continuity of the strokes from syllable to syllable ( a calligrapher's
delight) but such continuity can pose difficulties for the reader. It is
perhaps for this reason that some of the consonants have the property of
breaking the continuity so that the writing is not just one single stroke
from the beginning to the end.
It was stated that
the consonants have four possible shapes depending on their position within
a word.
A standalone consonant is
rarely seen in normal writing and is used only to show its basic shape.
A consonant at the beginning
of a word has continuity only with the syllable which follows. That is,
it connects on the left.
A consonant in the middle
connects with the preceding syllable as well as the one which follows and
hence connects on the right as well as the left.
A consonant at the end of
a word connects only with the preceding one.
The six non connecting
consonants mentioned above connect only with the preceding consonants.
They do not connect with the consonant that may follow. In other worlds
these connect only to the right. We now see why these provide the breaks
in writing. When a consonant follows one of these six, it is always written
as if it begins a new word even though it is in the middle of a word.
The IITM encoding
scheme is easily adapted for Arabic and Urdu by using only the 6 bit consonant
field and the four bit vowel field. Since conjuncts are absent, the five
bit intermediate field of the IITM code will be empty. But this five bit
field is used to indicate which one of the four forms the consonant should
appear. However, only two bits will be used here.
This approach is also applicable
to the special consonants which connect on only one side. The middle form
and the final form may be treated as identical in their cases.
In the encoding for
Arabic and Urdu, each of the four possible shapes for a consonant is treated
as an individual syllable, though representing the same linguistic content.
Thus, for each base consonant, four different syllable representations
will be present. While displaying the consonant, the appropriate form is
displayed by selecting the relevant code value. The two bit conjunct value
is automatically inserted by the Arabic or Urdu specific state machine
which handles the keyboard input.
The assignment within the
five bit field is as follows.
00- The middle form of the
consonant
01- The beginning form of
the consonant
10- The final form of the
consonant
11- The standalone form
of the consonant
This arrangement allows
us to retain the approach to representing consonant vowel combinations
while maintaining the display requirements. There is however the situation
which requires the use of the double consonant. The writing system uses
a specific mark above a consonant to indicate the doubling. In the case
of Indian languages, consonant doubling is treated as a conjunct and the
keyboard input method allows us to combine a consonant with itself. Since
the five bit field is used for a slightly different purpose in Arabic and
Urdu, it will be necessary to think of some means to generate double consonants.
In the current implementation
of the Arabic editor, consonant doubling has also been handled with a vowel
mark. This is linguistically incorrect. There are three more bits available
in the five bit field for additional syllables and so in principle, one
can handle consonant doubling easily. This has not been done in the present
editor.
If the input state machine
is modified, perhaps the five bit field could be specified differently
and one more bit assigned for consonant doubling. This should be tried
and the next version of the editor will probably incorporate the modification.
Please observe that
for linguistic purposes, the encoding is still very appropriate. The consonant
in the syllable is identified directly. So also the vowel. In modern Arabic
or Urdu writing, the vowel marks are not normally shown. This may also
be accommodated in the scheme by simply using a different lookup table.
The
Hamza.
In Arabic and Urdu,
the Hamza represents the Glottal stop and is linguistically treated as
a consonant. However, a carrier symbol is always associated with the Hamza
depending on its position in a word and the vowel it goes with. The Hamza
can appear with the "Alif", the medial forms of some consonants or by itself.
The different forms relate to the glottal stop combined with different
vowels. The conventions adopted for the use of the Hamza seem to be context
specific and it may be necessaary to handle the situation properly. The
coding scheme should permit the context to be discerned so that linguistic
processing may proceed properly.
Numerals
and punctuation.
The numerals in Arabic
and Urdu are written left to right following the normal convention in English.
The punctuation marks unique to Arabic and Urdu are handled in a fairly
straight forward fashion, exactly as in the scheme for Indian scripts.
A
note on lexical ordering of the letters
The lexical ordering
of the consonants is different from the ordering specified for Indian languages.
Arabic and Urdu have many consonants whose sounds do not correspond to
those of Sanskrit. Since there is enough code space for the consonants,
one need not unduly be concerned with this issue. Lexical ordering for
Hebrew also differs. It is quite easy to assign codes on the basis of the
conventional ordering for these languages but transliteration across the
scripts, specifically into Roman Diacritics, will involve additional
processing.