Acharya Logo
image
image
image
image
image
image
image
 
Home --> Web Applications
Search  
Displaying Indian Scripts on Web pages
Introduction
   Normal web pages are setup to conform to the "html" standard. An html document for a web page may contain information which includes text, graphical images, Multimedia content  etc.. The html concept is essentially a means to describing how a document's content should be displayed. Such information is provided through "tags" where a pair of tags specify how the content enclosed between them should be handled. In effect, an html document is a description of what the web page looks like, along with its contents.

   Web browsers are generally capable of interpreting html documents and displaying them within a window. To make sure that an html document is properly interpreted by web browsers on many different computers, certain conventions are followed. These conventions relate to issues that could cause differences in the resulting displays due to the manner in which the browser application functions or the support provided by the underlying operating system. The html document is constrained to using Roman letters for specifying the formatting information though the content itself may relate to non-Roman text. The job of the browser is to understand the formatting specified and render the contents using appropriate resources such as fonts, media players etc.

Text Encoding

  To properly relate the content to its displayed form, a browser will select a Font that would correctly show the characters whose codes constitute the contents of a page. The html document has a provision to specify how the text in the document should be interpreted by indicating the encoding scheme used in creating the text.  For standard English, this would be pure ASCII text conforming to the Latin character set. When text in other languages has to be displayed a different encoding scheme may be specified for the content. It is assumed here that the text is stored in terms of 8-bit values.

   Multilingual displays will run into difficulties because the browser may not know how to interpret an eight bit character unless corresponding coding information is known. Typically, this problem has been handled by specifying the use of a font that provides the required encoding. The html standard allows font specification tags to be present in a document so as to permit different parts of the content to be displayed with different fonts.

  The creator of the web page should ensure that the required font will be available with the Browser and that it supports the encoding of the text. Often, if the required font is not present in the system, the browser may substitute equivalent fonts and run the risk of incorrect text display. Experts recommend that if the font tag is used, then the text should conform to the actual encoding supported by the font which unfortunately can vary from system to system. Thus the font specification tag will be useful only when it is clearly known that the system running the browser will correctly interpret the codes constituting the text. Often this assumption fails.

Top

Single Encoding for multilingual content

  Multilingual content may be handled in a page if a common encoding scheme is chosen for different scripts. Unicode is one such scheme where close to 60,000 characters across 70 or more languages of the world have code assignments that will identify a character uniquely as coming from a specific script. Browsers are yet to comply with the full requirements of rendering Unicode Text. Hence the use of the font specification tag continues.

The most important  of these is the requirement that only pure ASCII text be used to specify the formatting of the information. As of today, the only item of information which is correctly identified on  all computer systems is plain ASCII text, consisting of 96  displayable characters.  So, html  documents may be prepared easily with even simple text editors on all computers. There is a well  written primer on preparing html documents which itself has been written in the html format. The  discussion below assumes that you understand the basic principles of html, especially the meaning and the interpretation of the tags. 

  Let us look at a very simple HTML document. The document begins with a <HTML> string and ends with a </HTML>. The html text is seen between the two horizontal lines. 



<HTML> 
<H1> This is a Heading </H1> 
<p> A simple HTML document is quite easy to generate using an ordinary text editor on virtually any computer. </p> 
<p> Greetings and Welcome </p> 
<a HREF="learn.html">Click here to learn more about HTML</a> 
</HTML> 


Shown below is a representation of what the document will look like when seen through a web browser. 
image
  It is observed that text strings enclosed within angle brackets "< >" act as guidelines to presenting and formatting the displayed information. The sentence which is underlined is a hypertext link to another document, in this case a primer on HTML, which is kept in the same computer as the file named learn.html. 

  From these it appears that HTML documents can display Roman text very easily. What about non Roman text or text in other languages? The HTML mechanism provide an answer to this through use of tags to specify different fonts to be used while displaying the text. The assumption here is that the codes in the text reflect the shapes to be displayed. 

  The earlier document may be modified as shown below to show portions of the document in a different font. Here we will use the Helvetica (a Sans Serif) font in place of the default font. 



<HTML> 
<font face="Helvetica"> 
<H1> This is a Heading </H1> 
<p> A simple HTML document is quite easy to generate using an ordinary text editor on Virtually any computer. </p> 
<p> Greetings and Welcome </p> 
<a HREF="learn.html">Click here to learn more about HTML</a> 
</font> 
</HTML> 

This document will display like this when viewed from a web browser. 


image


Top

Multilingual Displays
   The html standard allows for text displays to be effected using different fonts. By specifying different fonts for different sections of a page, one will be able to get pleasing   displays.  Specifying the font is done through the use of the " <font face= ...> ... </font>" pair of tags where, following the "font face=" part of the tag, one specifies the actual name of the font to be used for the display.  Given below is a sample html document which uses this  tag to display text strings in two different fonts. The image which follows shows the effect  when the html document is viewed with a web browser. It must be emphasized here that this approach is not considered sound, for there is no guarantee that the browser will be able to correctly interpret the codes in the text to select the shapes for display. This results from problems faced with differing Native Encodings supported in different systems. specifically the MacIntosh. Yet, this method works so long as the creator of the web page knows that the browser will handle the content properly.

  The discussion above establishes that multilingual text in Indian languages may be displayed on a web page, by using fonts  appropriate to the scripts to be shown.  The aksharas of the language displayed through  fonts will be built from the glyphs in the font.  When a section of the html page is to be  displayed in a specific script, the section may be bracketed by the  <font face=....> .......</font> tags and the text in the section would consists of the letters or special symbols  corresponding to the glyphs in the fonts. Text is always in ASCII and may include characters  in the region 160-255 of the ASCII codes.  GIven below is a screen image of an html page  containing text in Roman, Sanskrit Tamil, Telugu and Malayalam. 



<HTML> 
Multilingual presentations made easy. <br>
Sanskrit  </font><font face="Sanskrit 1.2">s&lt;Sk«tm! </font>Tamil </font>
<font face="iitmtam">êë¨Èª </font><br>
Telugu  </font><font face="Pothana"> ?"lÇgÇ </font>Malayalam </font> 
<font face="LTML-Manoj">cnReNcx <br>
</font> 
</HTML> 

image
  One cannot look at the contents of the html page and figure out what  aksharas will  appear in the display. The <fontface= "...">tags will give some idea of the font but unless the font  name reveals the script, it will be difficult to guess the same. With most indian scripts, there  will be glyphs in the 128-255 range of codes and in the html document these will show up as  characters that will include special signs, Roman letters with diacritics etc., making it difficult  to get an  idea of the display.  Yet, this is the best known method to display Indian Scripts  on web pages. 

  It might appear from the discussion above that it is not difficult to prepare html documents for displaying Indian language text. This is indeed the case except that no simple  technique exists for entering the ascii letters corresponding to the glyph codes. The selection  of the glyph codes is dependent on the font used for the script and one may not easily type   in characters whose ASCII codes lie in the range beyond 128. Even if this were possible  through some means, it will be difficult for the user who will think in terms of the Akshras and  not the glyph codes. While some desk top publishing packages and Word processors may allow  such codes to be actually input, the process is unduly cumbersome even for a single script,  let alone multilingual text. 

  In Roman fonts, a letter is specified through its ascii code and the location of its glyph  in the font is directly given by the code.  Hence when a font is changed, the displayed letter will remain the same and only its appearance will be different. In the case of most western languages, a letter of an alphabet will map directly in to a single glyph, specified through a code within the displayable ascii range (32 -127).  This makes it easy for html pages to be  typed in using the regular qwerty keyboard using a simple editor.  However, even for some European languages such as Greek, which employ accented characters, one faces the same  difficulty as the keyboard does not allow a simple way of entering the accented characters.  For Indian languages and scripts, data entry process that relies on glyph codes is unwieldy  as it is too complex and font dependent even for a single language.

  One solution to this problem is to effect the data entry not in terms of glyph codes but  through a transliteration mechanism  where each akshara is input through a standardized sequence of keystrokes. The transliterated text would thus consist of displayable ascii but  would be equivalent to the text in the Indian language. Also, one will be able to make out  what aksharas will be seen. The only problem with this scheme is that it is not suitable for directly generating the web page, for what is required in the web page will be the Glyph codes.  If some computer program can map the transliterated text into the appropriate glyph codes,  then the method will work. In fact this has been the method that has been suggested for html page creation using the ITRANS and JTRANS packages. These popular packages on  the web take a text file containing the transliterated text and produce a suitable html output.  One may not get much control over the formatting effected by the packages but that can be managed by directly editing the html file. 

   A number of Indian language magazines and newspapers appearing on the web use specific fonts for displaying their pages on the web. The web pages are prepared using some language specific and font specific data entry software and may contain Roman text as well.  These magazines  also allow the users to download the required fonts before reading the text.

Top

Using word processors to generate html files for Indian Scripts

    Word processors that support text preparation using user specified fonts may  also be used for preparing html pages. These useful packages come with features to convert  the document into html format automatically. Some even have html editing features built into  them making the job really easy. The problem of data entry remains however, since the   word processor allows direct data entry only in respect of the displayable ascii related  glyph codes. If the text in Indian language is available in a format compatible with the  word processor (e.g., rich text format) then the word processor's cut and paste facilities may  be effectively used to edit the html document. 

  The HTML standard allows one other nice way of specifying the glyphs to be displayed by giving their specific names or their code value in a prescribed manner.  The concept here is known as "Entities" where a name is associated with a code value. Instead of the code value, the text would have the name of the entity substituted. This is one way virtually any glyph can be displayed by just using entity names such as "&agrave;"  for specifying that the named glyph should be displayed (which happens to be the letter "a" with an accent mark "grave"). The HTML standard refers to these as character entities and most of the glyphs in standard encoded fonts have an entity name. While this method will most certainly work, it is nevertheless painful in practice. The MacIntosh refuses to render many entities however! So the problem of multilingual displays continues to plague us despite the tricks we may employ to get the rendering right on a specific platform.

Top

Using the IIT Madras multilingual editor to create html documents

   The  Multilingual editor from IIT Madras is a good choice for preparing html files for  display on the web. Using the editor one can directly type in the required html text just as one  would do with a normal editor.  The ability of the editor to handle data entry in Roman along  with the local scripts, comes in handy for this purpose. Shown below is the screen image of  an actual html file being typed in as a .llf file using the editor. Following the screen image is  the html document  obtained by converting the .llf file to html using the llf2html conversion utility given along with the editor. The final image is the display as seen in a web browser. Please observe that Roman text in the document as well as the HTML tags are retained as such but the glyphs for the Sanskrit and Tamil letters are substituted at the appropriate places. 


image


<HTML> 
<P>This is an example of an html file prepared using the IITM multilingual 
editor. The html tags may be typed in easily by switching the input to 
Roman. Normal local scripts may be typed in using the appropriate scripts. 
<PRE> 
<FONT FACE="Times">This section will appear pre formatted. 
</FONT><FONT FACE="Sanskrit 1.2">suSvagtm!&nbsp;&nbsp; </FONT><FONT FACE="iitmtam">åùªôõ² 
</FONT><A HREF="meaning.html"><FONT FACE="Times">&nbsp;meaning of </FONT><FONT FACE="Sanskrit 1.2">suSvagtm!</FONT></A></PRE> 
</HTML> 


image

  The llf2html program gives a straight html file which may be edited further with standard html editors to improve the formatting, if required.  At this point, it may be pertinent to ask, can we not use a standard html editor to begin with and select the fonts for indian scripts?  The answer to this is obviously yes, but one should remember that data entry has to proceed   based on Glyph information and so can never be natural, as the user thinks of the aksharas.  The proper method would be to use the IITM editor to quickly type in the local language text and  format it using the features of the html editor. Formatting usually does not involve further  editing of the text and so keyboard input will be unnecessary. If required the cut and  paste features may be used to effect minor editing of the text. 

  It is always meaningful to make the data entry process natural and simple.  The formation of complex conjuncts in the Indian scripts is still a formidable task even with the most sophisticated of the word processors.  Obviously we cannot expect anything from  programs which are built for one letter one glyph languages!  This is where data entry using  the IIT Madras software offers definite advantages as it supports a well defined  "one key sequence one akshara" mapping regardless of the number of glyphs used in  building up the akshara. 
 

Top


Multilingual displays with Unicode text in html documents

   It would appear that Unicode is the proper solution for web pages, for there would be little confusion as to the encoding and hence the choice of a font. It turns out that this is indeed so. Unfortunately, the rendering of Unicode text in Indian languages is beset with problems where the application is expected to know how a syllable should be rendered. The same Unicode text is known to appear differently on different browsers. Also, there is no general concurrence on how one should go about designing applications that render syllables rather than letters of the alphabet. 

  Despite this, Unicode based web pages may show Indian language content in a satisfactory manner though not the preferred way. If text on a web page has to be further processed (even copied and pasted into another multilingual application) there will be problems when different applications render the same Unicode text differently(Yes!). Vagaries of Unicode rendering is an amusing page to view.

Top

Creating e-content in Indian languages

  One of the difficulties faced in disseminating content in Indian languages is the lack of appropriate standards that allow development of content which could be rendered correctly and uniformly in the same manner across systems. Well established and proven approaches to dealing with ASCII text will not work well since the ASCII text will contain only shape information and not true linguistic content.  Methods based on Unicode suffer from the requirement that rendering is also partly the responsibility of the application (when open type fonts are used). We have already seen that the rendering of Unicode text in Indian languages varies considerably between applications. The best alternative to deal with these painful situations is to create the content in a format where rendering is built into the content.  This way, special resources and fonts will not be called for.

   One can use the Portable Document Format (PDF) to create files and embed the fonts used. This is a viable approach and is known to work well when the primary goal is to display content and not process it further. Also, the content could be generated on the fly as images (gif, png, jpg formats) and sent to the browser. Images are almost always guaranteed to be rendered properly on most browsers. The IITM utility to create PDF documents from llf files may be used effectively for this purpose. The online lessons to learn Sanskrit served from the Acharya site are good example of the approach.

Top

Browser independent display handling methods

  Browser independence is a complex matter. The HTML standard does not specify the actual resources which must be used in generating the display on the Browser but only indicates what is to be done. Browsers often substitute resources that are compatible and locally available when a specified resource (such as a font) is not present in the system. In respect of Indian languages, there are no standards (yet) for text rendering and worse, one cannot think of standards across scripts. Hence the best approach is to serve the content in a format that most browsers will render correctly. The approach to this using Dynamic Fonts may appear convincing but it is clearly established that the encoding used in the font does not get recognized properly across systems, thus restricting the Dynamic font approach to MS Windows platform. Dynamically generated images offer the best solution.

Using Java

  Java applets can be used with advantage on web pages. Unfortunately, the manner in which Java code is interpreted and run varies considerably due to rapid changes taking place in the Java Runtime support. Java may not allow flexible handling of different fonts. However, there are effective ways in which a specified font can be downloaded and used by an applet. This is a viable alternative but one must accept the fact that the Java code itself has to be written in such a way it will correctly execute inside the browser. In other words, a Java Applet is not guaranteed to run on each and every browser, even if the Java Runtime Environment is present in the system.

Top

Contents

Introduction

Text Encoding

Single Encoding for multilingual content

Multilingual Displays

Using word processors to generate html files for Indian Scripts
 

Using the IIT Madras multilingual editor to create html documents

Multilingual displays with Unicode text in html documents


Web Applications

Creating e-content in Indian languages

Browser independent display handling methods

Using Java
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

Acharya Logo
View of the Fish Tail mountain in the Himalayas.

Today is Mar. 12, 2010
Local Time: 08 00 26
Kali Year 5110
Month: Kumbham , Day:28
Star: Sravana


| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on 04/11/07    Best viewed at 800x600 or better