General Introduction
to Indian Languages
There are many many
languages spoken in India. Most of them relate to one of the officially
recognized languages and there are about eighteen languages identified
for regular use in the country. All these languages have a phonetic base,
though their writing systems vary. Some of the languages have a common
script and some have scripts of their own. There are nine basic scripts
besides the scripts for Urdu and Sindhi. These nine constitute the
basic scripts of India. The eighteen languages mentioned above, have
been given the status of official languages by the Government. Though the
use of a language may appear to be confined to a region within the
country, the mother tongue of many persons living in that region may be
quite a different one, traceable to early migrations of families.
All the
recognized languages (mostly referred to as the regional languages) have
a phonetic base. It is seen that there is a substantial set of words common
to many of these languages and the roots of these words may be traced to
specific languages such as Sanskrit or Tamil, both of which are considered
very ancient languages.
Linguistic
aspects of Indian languages have always attracted scholars from different
parts of the world on account of the hoary past of the languages as well
as their unique phonetic base. Another interesting aspect of Indian
languages is the fact that language was a means not merely for communication
in daily life but also for expressing religious, philosophical, scientific
and professional concepts in amazingly compact ways.
Computers and Indian Languages
It is
not surprising therefore, to find renewed interest today, in studying and
understanding many of India's ancient literary works. With the possibility
of using computers for linguistic studies and with the increasing
demand to disseminate information in the vernacular, computing in
Indian languages has gained significance. Though applications such as word
processing, Desk Top Publishing etc., have been successfully implemented
for Indian languages, the solutions remain substantially language specific.
One is happy that many of these applications are really useful in practice,
in spite of the effort needed in handling data entry. Yet, very little
seems to have been done in respect of electronically processing the information.
Viewed nationally, there is an urgent need to provide a uniform and meaningful
software solution to computing in Indian languages.
The phonetic
nature of the languages leads to a writing system which represents sounds
through unique symbols. Each language has its own representation
for the sounds and thus its own script, though it was mentioned earlier
that some languages may use a common script. In practice, there are small
variations in the scripts that probably matter when linguistic aspects
are brought in.
Writing Systems
The writing
systems for most Indian Languages employ symbols for about sixteen vowels
and as many as thirty five consonants. Syllables which are formed from
these basic sounds are also given unique representations. The term
conjunct is used to refer to a syllable formed from one or more consonants
and a vowel. Though one can theoretically think of thousands of conjuncts,
only about 800 of them are known to be in regular use and each of
these can combine with a vowel to make nearly 13000 or so individual
sounds, each with its own unique representation in the script.
Interestingly,
the writing systems employ just about 200 or so symbols to form the
unique shapes representing the conjuncts by combining shapes, somewhat
in the manner of adding ligatures. For each language,
well defined rules exist for writing most of the conjuncts and their combinations
with the vowels. The term Akshara is normally used to refer to a
consonant or a vowel or a simple combination of a consonant and a vowel.
The term Samyuktakshara is used to refer to conjuncts.
Handling Indian languages
on the computer is complicated by the requirement that each and every
one of these aksharas or samyuktaksharas be individually recognized.
Though only a few hundred primitive shapes may be employed in practice
to form the combinations, the large number of aksharas must necessarily
be identified individually for linguistic or text processing purposes.
Children in India are taught to identify thousands of aksharas and once
they have mastered reading the script, they find learning other languages,
including European languages, relatively easy.
The methods which work
well for a limited set of twenty six different letters in the Roman alphabet,
obviously fail or become inadequate when applied to Indian languages, not
only for the reason there are thousands of aksharas but also that there
is more than one accepted way of writing many of the combinations.
Though there are clear rules for writing the combinations, existing
practices permit multiple representations for the same conjunct, even within
a language, not to speak of variations across the languages.
Codes for the Aksharas
Thus there is need
to look at the problem of representing (coding) the large set of aksharas
so as to arrive at a standard that can apply uniformly across all
the Indian languages. Electronic text processing can then be attempted
using these codes.
The pioneering work
which resulted in the GIST technology at the Center for Development
of Advanced Computing must be regarded as the earliest of the attempts
towards some standardization. This development permitted DOS based
applications to handle Indian language text. The text was electronically
represented using the ISCII code and was
largely language independent, thus permitting a uniform approach
to dealing with the languages. Over the years, this hardware
dependent approach has been replaced by quality word processing and
data preparation software but the essential eight bit coding of the characters
has been retained. As will be explained in the section on Character
encoding for Indian languages, eight bit codes are not really
suitable for efficient string processing.
All the official languages
of the country are written using scripts specific to each language.
Scripts denote the writing systems employed by the languages to represent
the sounds which form the phonetic base of the languages. Currently, the
following language specific scripts are considered essential.
Devanagari, Gurmukhi,
Bengali, Gujarati, Oriya, Telugu, Tamil, Kannada and Malayalam. The
scripts for Urdu and Sindhi should also be included in the above, though
Devanagari is often used for writing in Sindhi.