Linguistic processing
(computation) with Indian languages.
In the context of computing with Indian languages, the basic quantum of information
to be processed is a syllable. The writing systems of India are based on
syllables. Computation with text in Indian languages is hence a question
of working with syllables. The representation of a syllable in the computer
assumes significance in this context.
Text processing algorithms
have generally been written for English since most of computing has been
based on the English Language and the information available electronically
is mostly in English. These algorithms work on a character of information
at a time. Text is represented as a string of characters specified through
codes (typically ASCII) for the letters of the alphabet and special symbols.
For example, an algorithm to check a word for a Palindrome simply reverses
the string and tries a match with the original. The length of a word is
specified in terms of the number of characters in the word.
The approach required for
Indian languages has to be different since all processing has to be done
with syllables. Text in any Indian language is reckoned only in this manner
and syllable identification is critical to determining the linguistic content.
Therefore the algorithm to identify a syllable gains significance.
Regrettably, the approaches
to representing text in Indian languages do not lend themselves to easy
implementations of text processing algorithms. There have been virtually
no accepted standards for coding schemes though one is constantly reminded
of ISCII, Unicode or even Font based schemes.
While ISCII and Unicode
have shown viability of implementations, they suffer from fairly serious
problems of unambiguous representations of syllables.
The pages
at this site discussing the issues threadbare more than convey the
problems of using variable length codes for representing syllables.
Leaving the problems aside,
the following are representative of the type of computations one would
effect from a linguistic point of view.
String processing and pattern
matching (Regular Expressions)
Indexing text and generating
concordances
Search applications (including
searches on the web)
Data Base Applications (mysql,
sql, etc.,)
Grammatical Analysis of
text (e.g., Morphological Analysis)
Parsing Text and Translation
Taggers and generating Linguistic
Corpora
Frequency of occurrences
of syllables
Transliteration across scripts
On-the-fly conversion of
text in to different formats (images, pdf etc.)
Text processing applications
available with the IITM Software.
The syllable frequency count
application is a particulary useful one. This specially written application
takes care of alternate forms (linguistically equivalent but differening
in view) for writing a syllable. The results of use of the application
on different texts in Sanskrit and Tamil can be seen in the linked page.
The applications which perform
on-the-fly conversion of text in to different formats will be very useful
for serving content on the web, where the most appropriate format for the
contents could be decided before sending the same to the Browser.
The "Learn Sanskrit through self study" lessons at this site have become
popular all over the world since they can be viewed on almost any Browser.
Here the lessons are sent in the form of images, converted on the fly when
the Browser requests a page containing Devanagari text.
Search applications are easy
to implement using the software base developed at IITM. The fixed size
syllable level code has made life much simpler for string processing. In
fact, conventional indexing software such as Swish-E can be directly used
to index the local language text prepared by the IITM editor or equivalent
software.