Indexing text and generating
concordances
The syllable level coding
scheme used in the IITM Software lends itself to direct use with algorithms
used in indexing text. Indexing text is usually done by Hashing methods
with clash-avoidance.
Standard indexing algorithms
such as gnu dbm could be used to index the local language text created
by the Multilingual Editor or any appropriate application. The indexing
application breaks the text into words and eliminates those given in a
specific list. The sixteen bit codes in the word are converted into a three
byte ASCII representation before being indexed and the reverse process
is used to arrive at the original syllable based representation while retrieving
matches.
IITM has developed its own
indexing software which can index a set of files, create a concordance
map and a sorted list of words. The front end for this would be a program
that uses the IITM local language library to interact with the user.
Alternatively, the popular
Swish-E application used extensively for indexing on the web may be utilized
for indexing.
The Indexing software developed
at IITM works the following way.
Create the required local
language files and organize them into a meaningful directory structure.
The IITM Multilingual Editor could be used for this purpose or conversion
utilities could be used to convert Indian language text in other formats
into the .llf form.
Create a list containing
the pathname to each file. This can be obtained from a recursive listing
of the directory and retaining only the Path names.
Run the indexing program
by specifying the list on the command line.
Run the utility program to
generate concordance information for each word.
Additionally, run other utilities
to generate word lists and sort them.
The search applications hosted
at this site (Bhagavadgita and Tirukkural) have been generated using the
above steps.