Hubbry Logo
search
logo

Tamil All Character Encoding

logo
Community Hub0 Subscribers
Write something...
Be the first to start a discussion here.
Be the first to start a discussion here.
See all
Tamil All Character Encoding

Tamil All Character Encoding (TACE16) is a scheme for encoding the Tamil script in the Private Use Area of Unicode, implementing a syllabary-based character model differing from the modified-ISCII model used by Unicode's existing Tamil implementation.

The keyboard driver for this encoding scheme is available on the Tamil Virtual Academy website for free. It uses Tamil 99 and Tamil Typewriter keyboard layouts, which are approved by the Government of Tamil Nadu, and maps the input keystrokes to its corresponding characters of the TACE16 scheme. To read files created using TACE16, the corresponding Unicode Tamil fonts are also available on the same website. These fonts map glyphs for characters of TACE16 format, but also for the Unicode block for both ASCII and Tamil characters, so that they can provide backward compatibility for reading existing files which are created using the Tamil Unicode block.

All the characters of this encoding scheme are located in the private use area of the Basic Multilingual Plane of Unicode's Universal Coded Character Set.

The existing Unicode character model for Tamil is, like most of Indic Unicode, an abugida-based model derived from ISCII. It been criticized for several reasons.

Unicode represents only 31 Tamil base characters as single code points, out of 247 grapheme clusters. These include stand-alone vowels, and 23 basic consonant glyphs (which, due to not bearing a virama, nonetheless denote a syllable with both a consonant and a vowel when used on their own). The others are represented as sequences of code points, requiring software support for advanced typography features (such as Apple Advanced Typography, Graphite, or OpenType advanced typography) to render correctly. This also requires the use of invisible zero-width joiner and zero-width non-joiner characters in places where the desired grapheme cluster would otherwise be ambiguous. This complexity can result in security vulnerabilities and ambiguous combinations, can require the use of an exception table to forbid invalid combinations of code points, and can necessitate the use of string normalization to compare two strings for equality.

Additionally, since syllables with both a consonant and a vowel form 64 to 70% of Tamil text, an abugida-based model which encodes the consonant and vowel parts as separate code points is inefficient, in terms of how long a string needs to be to contain a given piece of text, in comparison with a syllabary-based model.

Furthermore, ISCII is primarily an encoding of Devanagari, and the ISCII encodings of other Brahmic scripts (including Tamil) encode characters over the code points of the corresponding characters in Devanagari ISCII. Although Unicode encodes the Brahmic scripts separately from one another, the Tamil block mirrors the ISCII layout (with Devanagari-style character ordering, and reserved space in positions corresponding to Devanagari characters with no Tamil equivalent); consequently, the characters are not in the natural sequence order, and strings collated by code point (analogous to "ASCIIbetical" sorting of English text) will not produce the expected sorting order. It requires a complex collation algorithm for arranging them in the natural order.

The following data provides a comparison of current Unicode Tamil vs. TACE16 on e-governance and browsing:[better source needed]

See all
User Avatar
No comments yet.