Encoding on the Internet
Before encoding issues can be understood, it's important to understand the design of different types of scripts.
Technically, you can support any language you want on the Internet - just transliterate it into an English alphabet with no accent marks or non-U.S. punctuation and the language is "supported". Although this is what happens in e-mail and chat-rooms all across the world, people would vastly prefer to be able to use the Internet in their native scripts.
However, the needs of data transmission and the designs of different scripts make placing non-English material online a complex problem. Computers were initially designed with English and mind, and still have to be re-engineered to handle other scripts.
An alphabet is a system where each letter represents a single consonant or vowel. Because computers are based on English, a left to right alphabet is relatively easy to encode. Some notable alphabets include:
It is important that although some languages are associated with certain defined scripts, the script and language are not identical. In fact some "dual languages" like Serbo-Croatian or Hindi-Urdu are really closely related languages which can be written in multiple scripts. "Croatian" is traditionally written in the Roman alphabet, while "Serbian" can be written in either Cyrillic or Roman. Similarly "Hindi" is written in Devanagari, the writing system associated with Hinduism, while "Urdu" is written in the Arabic script and is associated with Islam. Here, the writing system is associated with a religious difference, so both scripts should be supported.
Several Middle Eastern alphabet including Arabic and Hebrew are written right-to-left, so directionality should be specified in the HTML.
In addition, most Arabic and some Hebrew letter forms vary depending on whether a consonant is at the beginning, middle or the end of a word (this was done to make manuscript writing faster). In computing terms, the same letter may need to be displayed in several alternate formats depending on its position in the word.
Note: All alphabets originate from an ancient Semitic script developed by Bronze Age turquoise miners working in Egypt. Although the forms were based on hieroglyphics, the sounds are based on Semitic words, not Egyptian.
A syllabary script is one in which one symbol represents a single syllable (consonant-vowel) sequence. These scripts were developed for languages in which most syllables end in a vowel, so the writing of these languages is more compact. Examples of syllabaries include:
Although writing is shorter, more symbols are needed because there are more possible combinations of consonant+vowel. The encoding of syllabaries therefore requires larger fonts and different allocations of memory.
Scripts in which write consonants as the main letters, but then use special symbols or "vowel marks" to indicate which vowels follow the consonant are called "syllabic alphabets" These are true alphabets in the sense that consonants and vowels are independently written, but because the vowels and consonants combine to form complex characters, they can be difficult to encode.
Most South Asian and Southeast Asian scripts are syllabic alphabets; these include Hindi (Devanagari), Tamil, Gujarati, Bengali, Thai, Balinese, Hmong, Thai, Tibetan and many others. Korean Hangul can also be classified as a syllabic alphabet.
In an ideographic script, a character is used to represent one concept, regardless of pronunciation. The Chinese script is considered ideographic, although there are methods to write phonetic pronunciations. The Chinese script is probably the largest in terms of characters, because each concept requires a symbol. Compounding is commonly used to create additional words.
The Roman alphabet does include ideographs for numbers (e.g. 1,2,3) and currency symbols which can be pronounced differently depending on the language in question. Thus the ideograph for 1 can be pronounced as one in English, uno in Spanish, un in French, ein in German and bat in Basque.
©Penn State University, 2000-2013.
This Web page maintained by Teaching and Learning with Technology, a unit of Information Technology Services. For questions or comments on this Web page, please contact Elizabeth J. Pyatt (firstname.lastname@example.org).
This site uses Unicode to display non-English characters. This site is best viewed in the most recent versions of your browser.
Unicode character names and hexadecimal entity codes are taken from the public Unicode Character Charts.
This publication is available in alternate media upon request.
Last Modified: Tuesday, 04-Jun-2013 12:41:28 EDT