Displaying Indian Scripts on Web
pages
Introduction
Normal
web pages are setup to conform to the "html" standard. An html document
for a web page may contain information which includes text, graphical images,
Multimedia content etc.. The html concept is essentially a means
to describing how a document's content should be displayed. Such information
is provided through "tags" where a pair of tags specify how the content
enclosed between them should be handled. In effect, an html document is
a description of what the web page looks like, along with its contents.
Web browsers
are generally capable of interpreting html documents and displaying them
within a window. To make sure that an html document is properly interpreted
by web browsers on many different computers, certain conventions are followed.
These conventions relate to issues that could cause differences in the
resulting displays due to the manner in which the browser application functions
or the support provided by the underlying operating system. The html document
is constrained to using Roman letters for specifying the formatting information
though the content itself may relate to non-Roman text. The job of the
browser is to understand the formatting specified and render the contents
using appropriate resources such as fonts, media players etc.
Text
Encoding
To properly relate
the content to its displayed form, a browser will select a Font that would
correctly show the characters whose codes constitute the contents of a
page. The html document has a provision to specify how the text in the
document should be interpreted by indicating the encoding scheme used in
creating the text. For standard English, this would be pure ASCII
text conforming to the Latin character set. When text in other languages
has to be displayed a different encoding scheme may be specified for the
content. It is assumed here that the text is stored in terms of 8-bit values.
Multilingual
displays will run into difficulties because the browser may not know how
to interpret an eight bit character unless corresponding coding information
is known. Typically, this problem has been handled by specifying the use
of a font that provides the required encoding. The html standard allows
font specification tags to be present in a document so as to permit different
parts of the content to be displayed with different fonts.
The creator of the web page
should ensure that the required font will be available with the Browser
and that it supports the encoding of the text.
Often, if the required font
is not present in the system, the browser may
substitute equivalent fonts
and run the risk of incorrect text display. Experts recommend that if the
font tag is used, then the text should conform to the actual encoding supported
by the font which unfortunately can vary from system to system. Thus the
font
specification tag will be useful only when it is clearly known that the
system running the browser will correctly interpret the codes constituting
the text. Often this assumption fails.
Top
Single Encoding
for multilingual content
Multilingual content
may be handled in a page if a common encoding scheme is chosen for different
scripts. Unicode is one such scheme where close to 60,000 characters across
70 or more languages of the world have code assignments that will identify
a character uniquely as coming from a specific script. Browsers are yet
to comply with the full requirements of rendering Unicode Text. Hence the
use of the font specification tag continues.
The most important
of these is the requirement that only pure ASCII text be used to specify
the formatting of the information. As of today, the only item of information
which is correctly identified on all computer systems is plain ASCII
text, consisting of 96 displayable characters. So, html
documents may be prepared easily with even simple text editors on all computers.
There is a well written
primer on preparing
html documents which itself has been written in the html format. The
discussion below assumes that you understand the basic principles of html,
especially the meaning and the interpretation of the tags.
Let us look at a very
simple HTML document. The document begins with a <HTML> string and ends
with a </HTML>. The html text is seen between the two horizontal lines.
<HTML>
<H1> This is a Heading
</H1>
<p> A simple HTML document
is quite easy to generate using an ordinary text editor on virtually any
computer. </p>
<p> Greetings and Welcome
</p>
<a HREF="learn.html">Click
here to learn more about HTML</a>
</HTML>
Shown below is a representation
of what the document will look like when seen through a web browser.
It is observed
that text strings enclosed within angle brackets "< >" act as guidelines
to presenting and formatting the displayed information. The sentence which
is underlined is a hypertext link to another document, in this case a primer
on HTML, which is kept in the same computer as the file named learn.html.
From these it appears
that HTML documents can display Roman text very easily. What about non
Roman text or text in other languages? The HTML mechanism provide an answer
to this through use of tags to specify different fonts to be used while
displaying the text. The assumption here is that the codes in the text
reflect the shapes to be displayed.
The earlier document
may be modified as shown below to show portions of the document in a different
font. Here we will use the Helvetica (a Sans Serif) font in place of the
default font.
<HTML>
<font face="Helvetica">
<H1> This is a Heading
</H1>
<p> A simple HTML document
is quite easy to generate using an ordinary text editor on Virtually any
computer. </p>
<p> Greetings and Welcome
</p>
<a HREF="learn.html">Click
here to learn more about HTML</a>
</font>
</HTML>
This document will display
like this when viewed from a web browser.

Top
Multilingual
Displays
The
html standard allows for text displays to be effected using different fonts.
By specifying different fonts for different sections of a page, one will
be able to get pleasing displays. Specifying the font
is done through the use of the " <font face= ...> ... </font>" pair
of tags where, following the "font face=" part of the tag, one specifies
the actual name of the font to be used for the display. Given below
is a sample html document which uses this tag to display text strings
in two different fonts. The image which follows shows the effect
when the html document is viewed with a web browser. It must be emphasized
here that this approach is not considered sound, for there is no guarantee
that the browser will be able to correctly interpret the codes in the text
to select the shapes for display. This results from problems faced with
differing Native Encodings supported in different systems. specifically
the MacIntosh. Yet, this method works so long as the creator of the web
page knows that the browser will handle the content properly.
The discussion above
establishes that multilingual text in Indian languages may be displayed
on a web page, by using fonts appropriate to the scripts to be shown.
The aksharas of the language displayed through fonts will be built
from the glyphs in the font. When a section of the html page is to
be displayed in a specific script, the section may be bracketed by
the <font face=....> .......</font> tags and the text in the
section would consists of the letters or special symbols corresponding
to the glyphs in the fonts. Text is always in ASCII and may include characters
in the region 160-255 of the ASCII codes. GIven below is a screen
image of an html page containing text in Roman, Sanskrit Tamil, Telugu
and Malayalam.
<HTML>
Multilingual presentations
made easy. <br>
Sanskrit </font><font
face="Sanskrit 1.2">s<Sk«tm! </font>Tamil </font>
<font face="iitmtam">êë¨Èª
</font><br>
Telugu </font><font
face="Pothana"> ?"lÇgÇ </font>Malayalam </font>
<font face="LTML-Manoj">cnReNcx
<br>
</font>
</HTML>
One cannot
look at the contents of the html page and figure out what aksharas
will appear in the display. The <fontface= "...">tags will give
some idea of the font but unless the font name reveals the script,
it will be difficult to guess the same. With most indian scripts, there
will be glyphs in the 128-255 range of codes and in the html document these
will show up as characters that will include special signs, Roman
letters with diacritics etc., making it difficult to get an
idea of the display. Yet, this is the best known method to display
Indian Scripts on web pages.
It might appear from
the discussion above that it is not difficult to prepare html documents
for displaying Indian language text. This is indeed the case except that
no simple technique exists for entering the ascii letters corresponding
to the glyph codes. The selection of the glyph codes is dependent
on the font used for the script and one may not easily type
in characters whose ASCII codes lie in the range beyond 128. Even if this
were possible through some means, it will be difficult for the user
who will think in terms of the Akshras and not the glyph codes. While
some desk top publishing packages and Word processors may allow such
codes to be actually input, the process is unduly cumbersome even for a
single script, let alone multilingual text.
In Roman fonts, a letter is specified through its ascii code and the location
of its glyph in the font is directly given by the code. Hence
when a font is changed, the displayed letter will remain the same and only
its appearance will be different. In the case of most western languages,
a letter of an alphabet will map directly in to a single glyph, specified
through a code within the displayable ascii range (32 -127). This
makes it easy for html pages to be typed in using the regular qwerty
keyboard using a simple editor. However, even for some European languages
such as Greek, which employ accented characters, one faces the same
difficulty as the keyboard does not allow a simple way of entering the
accented characters. For Indian languages and scripts, data entry
process that relies on glyph codes is unwieldy as it is too complex
and font dependent even for a single language.
One solution to this
problem is to effect the data entry not in terms of glyph codes but
through a transliteration mechanism where each akshara is input through
a standardized sequence of keystrokes. The transliterated text would thus
consist of displayable ascii but would be equivalent to the text
in the Indian language. Also, one will be able to make out what aksharas
will be seen. The only problem with this scheme is that it is not suitable
for directly generating the web page, for what is required in the web page
will be the Glyph codes. If some computer program can map the transliterated
text into the appropriate glyph codes, then the method will work.
In fact this has been the method that has been suggested for html page
creation using the ITRANS and JTRANS packages. These popular packages on
the web take a text file containing the transliterated text and produce
a suitable html output. One may not get much control over the formatting
effected by the packages but that can be managed by directly editing the
html file.
A
number of Indian language magazines and newspapers appearing on the web
use specific fonts for displaying their pages on the web. The web pages
are prepared using some language specific and font specific data entry
software and may contain Roman text as well. These magazines
also allow the users to download the required fonts before reading the
text.
Top
Using
word processors to generate html files for
Indian Scripts
Word processors that support text preparation using user specified fonts
may also be used for preparing html pages. These useful packages
come with features to convert the document into html format automatically.
Some even have html editing features built into them making the job
really easy. The problem of data entry remains however, since the
word processor allows direct data entry only in respect of the displayable
ascii related glyph codes. If the text in Indian language is available
in a format compatible with the word processor (e.g., rich text format)
then the word processor's cut and paste facilities may be effectively
used to edit the html document.
The HTML standard
allows one other nice way of specifying the glyphs to be displayed by giving
their specific names or their code value in a prescribed manner.
The concept here is known as "Entities" where a name is associated with
a code value. Instead of the code value, the text would have the name of
the entity substituted. This is one way virtually any glyph can be displayed
by just using entity names such as "à" for specifying
that the named glyph should be displayed (which happens to be the letter
"a" with an accent mark "grave"). The HTML standard refers to these as
character entities and most of the glyphs in standard encoded fonts have
an entity name. While this method will most certainly work, it is nevertheless
painful in practice. The MacIntosh refuses to render many entities however!
So the problem of multilingual displays continues to plague us despite
the tricks we may employ to get the rendering right on a specific platform.
Top
Using
the IIT Madras multilingual editor to create
html documents
The Multilingual editor from IIT Madras is a good choice for preparing
html files for display on the web. Using the editor one can directly
type in the required html text just as one would do with a normal
editor. The ability of the editor to handle data entry in Roman along
with the local scripts, comes in handy for this purpose. Shown below is
the screen image of an actual html file being typed in as a .llf
file using the editor. Following the screen image is the html document
obtained by converting the .llf file to html using the llf2html conversion
utility given along with the editor. The final image is the display as
seen in a web browser. Please observe that Roman text in the document as
well as the HTML tags are retained as such but the glyphs for the Sanskrit
and Tamil letters are substituted at the appropriate places.

<HTML>
<P>This is an example
of an html file prepared using the IITM multilingual
editor. The html tags may
be typed in easily by switching the input to
Roman. Normal local scripts
may be typed in using the appropriate scripts.
<PRE>
<FONT FACE="Times">This
section will appear pre formatted.
</FONT><FONT FACE="Sanskrit
1.2">suSvagtm! </FONT><FONT FACE="iitmtam">åùªôõ²
</FONT><A HREF="meaning.html"><FONT
FACE="Times"> meaning of </FONT><FONT FACE="Sanskrit 1.2">suSvagtm!</FONT></A></PRE>
</HTML>

The llf2html
program gives a straight html file which may be edited further with standard
html editors to improve the formatting, if required. At this point,
it may be pertinent to ask, can we not use a standard html editor to begin
with and select the fonts for indian scripts? The answer to this
is obviously yes, but one should remember that data entry has to proceed
based on Glyph information and so can never be natural, as the user thinks
of the aksharas. The proper method would be to use the IITM editor
to quickly type in the local language text and format it using the
features of the html editor. Formatting usually does not involve further
editing of the text and so keyboard input will be unnecessary. If required
the cut and paste features may be used to effect minor editing of
the text.
It is always meaningful
to make the data entry process natural and simple. The formation
of complex conjuncts in the Indian scripts is still a formidable task even
with the most sophisticated of the word processors. Obviously we
cannot expect anything from programs which are built for one letter
one glyph languages! This is where data entry using the IIT
Madras software offers definite advantages as it supports a well defined
"one key sequence one akshara" mapping regardless of the number of glyphs
used in building up the akshara.
Top
Multilingual
displays with Unicode text in html documents
It would appear that Unicode is the proper solution for web pages, for
there would be little confusion as to the encoding and hence the choice
of a font. It turns out that this is indeed so. Unfortunately, the rendering
of Unicode text in Indian languages is beset with problems where the application
is expected to know how a syllable should be rendered. The same Unicode
text is known to appear differently on different browsers. Also, there
is no general concurrence on how one should go about designing applications
that render syllables rather than letters of the alphabet.
Despite this, Unicode based web pages may show Indian language content
in a satisfactory manner though not the preferred way. If text on a web
page has to be further processed (even copied and pasted into another multilingual
application) there will be problems when different applications render
the same Unicode text differently(Yes!). Vagaries
of Unicode rendering is an amusing page to view.
Top
Creating e-content
in Indian languages
One of the difficulties
faced in disseminating content in Indian languages is the lack of appropriate
standards that allow development of content which could be rendered correctly
and uniformly in the same manner across systems. Well established and proven
approaches to dealing with ASCII text will not work well since the ASCII
text will contain only shape information and not true linguistic content.
Methods based on Unicode suffer from the requirement that rendering is
also partly the responsibility of the application (when open type fonts
are used). We have already seen that the rendering of Unicode text in Indian
languages varies considerably between applications. The best alternative
to deal with these painful situations is to create the content in a format
where rendering is built into the content. This way, special resources
and fonts will not be called for.
One can use
the Portable Document Format (PDF) to create files and embed the fonts
used. This is a viable approach and is known to work well when the primary
goal is to display content and not process it further. Also, the content
could be generated on the fly as images (gif, png, jpg formats) and sent
to the browser. Images are almost always guaranteed to be rendered properly
on most browsers. The IITM utility to create PDF documents from llf files
may be used effectively for this purpose. The online lessons to learn Sanskrit
served from the Acharya site are good example of the approach.
Top
Browser
independent display handling methods
Browser independence
is a complex matter. The HTML standard does not specify the actual resources
which must be used in generating the display on the Browser but only indicates
what is to be done. Browsers often substitute resources that are compatible
and locally available when a specified resource (such as a font) is not
present in the system. In respect of Indian languages, there are no standards
(yet) for text rendering and worse, one cannot think of standards across
scripts. Hence the best approach is to serve the content in a format that
most browsers will render correctly. The approach to this using Dynamic
Fonts may appear convincing but it is clearly established that the encoding
used in the font does not get recognized properly across systems, thus
restricting the Dynamic font approach to MS Windows platform. Dynamically
generated images offer the best solution.
Using Java
Java applets can be
used with advantage on web pages. Unfortunately, the manner in which Java
code is interpreted and run varies considerably due to rapid changes taking
place in the Java Runtime support. Java may not allow flexible handling
of different fonts. However, there are effective ways in which a specified
font can be downloaded and used by an applet. This is a viable alternative
but one must accept the fact that the Java code itself has to be written
in such a way it will correctly execute inside the browser. In other words,
a Java Applet is not guaranteed to run on each and every browser, even
if the Java Runtime Environment is present in the system.
Top
|
Contents
Introduction
Text
Encoding
Single
Encoding for multilingual content
Multilingual
Displays
Using
word processors to generate html files for Indian Scripts
Using
the IIT Madras multilingual editor to create html documents
Multilingual
displays with Unicode text in html documents
Web
Applications
Creating
e-content in Indian languages
Browser
independent display handling methods
Using Java
|