Rendering a text string
(
Approach to displaying the Aksharas from syllable level codes)
The structure of the
syllable level code ( in the form of a triplet) allows easy rendering of
the syllable by specifying the sequence of glyphs to be displayed. The
sequence of glyphs depends on the Font used. Since the code for the syllable
is fixed at two bytes, it is possible to employ a simple lookup for getting
the glyphs associated with any syllable supported by the coding scheme.
Almost any eight bit font could be used here if the glyphs are considered
adequate for rendering the text. This method will work properly across
platforms if the support required for rendering glyphs is available, given
the glyph string and the underlying encoding. Typically, a font conforming
to ISO-8859-1 will render properly on most systems. Hence the IITM scheme
recommends the use of such fonts.
The IITM scheme
allows correct rendering of text using the IITM fonts on all the important
platforms- MSWindows, Unix, Macintosh and PostScript Devices. It is possible
to correctly view and print the contents of a web page prepared using
the IITM fonts, for in each case, the glyphs are named conforming to the
native encoding of the corresponding l the platforms. This is essentially
the ISO-8859 scheme with glyph names mapped properly in each platform.
The lookup is
effected through a three dimensional array corresponding to the triplet
representation.
This array is specified through a
file which is read in by the application when it is invoked. This
approach also permits dynamic selection of the lookup since the application
can reinitialize the array as and when required. Thus, the same local
language text may be shown in different fonts, irrespective of the
glyph locations so long as the required set of glyphs is supported. Also,
the fact that only eight bit fonts are needed gives us the freedom to use
freely distributed fonts.
The 15 bit syllable
code allows us to represent as many as 32 thousand aksharas. That many
are not required in practice. The overall display process is illustrated
below. Please remember that the actual code is 16 bits wide. The most significant
bit is not used as part of the code but serves as an indicator when the
16 bits contain information that should be interpreted rather than rendered.
Such a situation arises when a change of script is specified.
The
application may dynamically select a font for displaying the text and generate
the glyph string by looking up the related table. Each entry in the table
corresponds to to an element of the three dimensional array into which
the triplet is mapped. A typical entry in the file which provides the table
(the file ia given a .tab extension to identify it as a table) has to conform
to this simple form. (Please refer to the discussion of the IITM syllable
level coding scheme for additional information)
[c(i),cnj(j),v(k)]
= g1g2g3..gn
c(i)
- is the code assigned for the base consonant in the syllable.
cnj(j)
- is the value of the conjunct part of the syllable in the range 1-31
v(k)
- is the code assigned to the vowel.
g1g2g3 ..gn
are the eight bit indexes of the glyphs which must be rendered to display
the akshara. In practice, these glyphs will have the required shapes for
the consonants, matras as well as other ligatures to correctly build the
shape for the akshara to be displayed. It should be observed here that
the set g1g2..gn will have to be individually identified for each entry
in the table. With as many as 12000 or more entries possible in the table,
this may appear to be a formidable task. Fortunately, it is possible to
generate this table with a clever program and fine tune the entries manually.
Also, since glyphs for a script cannot really vary a lot across fonts,
generating the table for a second font will be easy if the table for the
first font is already available.
It will also
be seen that the set of glyphs g1g2..gn for any akshara will vary from
font to font. The manner in which one builds up an akshara by choosing
appropriate glyphs may also vary depending on the specific form required
to be shown since an akshara may be shown in more than one form.
Another important
point to note here is that the syllable is not associated with a script
but a language.
Hence it will be easy to show the
same syllable in multiple scripts merely by switching the fonts and the
associated tables. It is therefore possible to achieve perfect transliteration
across scripts.
The glyph string is
the basis for generating the display. One need not stop with just displaying
the glyphs using the API provided by the Operating System to display text
strings. One can indeed pass the glyph string to other applications that
can provide different visual representations such as an image of the text,
an XML description of the display, HTML output or just a PostScript representation.
Thus the fixed length syllable level code can be used effectively to produce
a variety of display formats and one can dynamically choose the format
by merely passing the glyph string to an appropriate module. This will
have significant benefits when generating displays for web browsers which
are know to support one format well, say the PDF. The local language text
may then be sent as PDF file to the browser on demand and hence guarantee
viewing of the text.
The on-line
demo at the acharya web site demonstrates this well.
The essence of the
IITM scheme of representing text lies in the power of the fixed length
code for a well chosen set of syllables. This approach truly separates
the rendering process from the encoding process and so applications can
very easily render text through common APIs that are proven and known to
work across systems. While one may argue that the scheme does not cater
to all possible syllables, the approach is indeed scalable if the code
space available can be extended to 32 bits. Variable length representations
such as Unicode, always require applications to know how an akshara should
be rendered and check if this rendering is possible with the font provided
(an Open Type font in this case). Thus an application supporting Unicode
for Indian languages has to be necessarily language or script dependent.
There is no simple method which allows the same Unicode text (same set
of syllables) to be displayed in multiple scripts.
Working
with Unicode Fonts
The mapping
function can also specify sixteen bit glyph index values on the right hand
side of the entries in the table. This will allow sixteen bit fonts to
be used. Since arbitrary Unicode values cannot be assigned to the glyphs,
the range E000-E7FF, reserved for user defined codes may be utilized for
this purpose and glyphs located in this region. In fact Open Type Fonts
for Indian languages use this range to define the composite glyphs used
in forming the aksharas.
It turns out
that there is no need to use Open Type Fonts when 16 bit character codes
are used. The Open Type Font is specifically contrived to cater to mapping
variable length codes to specific sequence of glyphs. One has seen that
it is really not possible with the existing Open Type Fonts (Mangal under
Windows or the BBC fonts etc.) to cater to the full compliment of ligatures
used in traditional print. Yet, one could do quite well with 8 bit fonts
supporting 240 or so glyphs. The only difficulty with these eight bit fonts
is that they are not guaranteed to render properly on any platform and
an eight bit font that provides compatibility cannot really have more than
about 188 glyphs. With 16 bit Fonts, we can use many more than 256
glyphs and achieve high quality typesetting, so long as the underlying
system has the support to render each glyph properly. It is observed that
rendering of the glyphs in the user defined area is done properly only
under MS windows. The Mac and Linux systems seem to have problems (as of
this writing) in handling glyphs in the range E000-E7FF.
There are many advantages
to using traditional True Type fonts if they can support enough number
of glyphs to fulfill the requirements of rendering text. The IITM Unicode
Font has the required number of Glyphs to handle all the scripts. Each
script has a 256 glyph locations, many more than the 188 possible with
eight bit fonts.
The development
team at SDL has come up with a utility to convert .llf files to HTML using
the IITM Multilingual Unicode font. The HTML file is produced in UTF-8
format. You can compare the display of the same text using standard
Unicode rendering against the display produced with IITM Unicode fonts.
Here is the link
to the concerned page which provides the details.
Converting
Glyph strings to llf
Applications
supporting cut/copy and paste operations may also be able to map the displayed
glyphs back to their corresponding llf characters. Basically, the glyphs
of the font are to be interpreted as symbols of text which may be parsed
and the text converted through Lex and Yacc based tools.
In syllabic
writing systems, it is not always required that there be a one to one mapping
between the internally stored text string and the corresponding glyph string
used to display the same. However, a one to one mapping can simplify the
reverse process of identifying a syllable from a glyph. In this case, the
number of glyphs in the font can well be of the order of 10000 or more
if alternate rendering of text should also be accommodated. In practice
however, one takes advantage of the fact that a syllable can be shaped
according to rules using a much smaller set, perhaps of the order of hundreds.
With this approach, getting back a syllable code from the glyphs will be
a challenge.
Encoding standards
such a Unicode suffer from this problem where a multibyte string (variable
length) is used to represent a syllable. It is common practice in these
cases to employ two internal buffers while dealing with displayed text,
one for the text consistent with the encoding method and the other for
the glyph string. Cut/Copy and paste operations always use these two buffers
to maintain the correct relationship between the text string and the glyph
string.
In the IITM approach,
the application is rquired to maintain two buffers, one for the syllable
codes and the other for any representation of the displayed text that can
directly relate to the font, such as a glyph string, an XML based
description of the string etc.
In the MSWindows version
of the IITM multilingual editor, the internal buffer for the glyph string
is managed through Rich Text controls and hence copy/paste operations are
straightforward when text is pasted onto other applications. However, once
pasted, the text will not conform to the llf standard and so editing will
not be possible at the syllable level but only at the glyph level.