Teaching and Learning with Technology

Computing With Accents and Foreign Scripts

Skip Menu

Encoding on the Internet

0: Types of Scripts

Previous Page | Next Page

Before encoding issues can be understood, it's important to understand the design of different types of scripts.

Script Support vs. Language Support

Technically, you can support any language you want on the Internet - just transliterate it into an English alphabet with no accent marks or non-U.S. punctuation and the language is "supported". Although this is what happens in e-mail and chat-rooms all across the world, people would vastly prefer to be able to use the Internet in their native scripts.

However, the needs of data transmission and the designs of different scripts make placing non-English material online a complex problem. Computers were initially designed with English and mind, and still have to be re-engineered to handle other scripts.

Top of Page

Alphabets (Left-To-Right)

An alphabet is a system where each letter represents a single consonant or vowel. Because computers are based on English, a left to right alphabet is relatively easy to encode. Some notable alphabets include:

  1. Roman Alphabet - The alphabet used to write English is actually the Roman or Latin alphabet. It was developed by the Romans based on Greek and Etruscan forms, then spread throughout Europe during the Roman Empire then to different parts of the worlds by different European powers.
  2. Cyrillic Alphabet - The Russian alphabet which combines elements from the Greek and Roman alphabets. Cyrillic is used to write many languages in the former Soviet Union including Ukrainian, Belarusian, Uzbek, and more.
  3. Greek Alphabet - First developed by the Ancient Greeks.
  4. Other Alphabets - Armenian, Georgian, Runic, Coptic, Somali, Phonetic Symbols, Ogam, Braille, others

Scripts are Not Languages

It is important that although some languages are associated with certain defined scripts, the script and language are not identical. In fact some "dual languages" like Serbo-Croatian or Hindi-Urdu are really closely related languages which can be written in multiple scripts. "Croatian" is traditionally written in the Roman alphabet, while "Serbian" can be written in either Cyrillic or Roman. Similarly "Hindi" is written in Devanagari, the writing system associated with Hinduism, while "Urdu" is written in the Arabic script and is associated with Islam. Here, the writing system is associated with a religious difference, so both scripts should be supported.

Top of Page

Right-to-Left Alphabets

Several Middle Eastern alphabet including Arabic and Hebrew are written right-to-left, so directionality should be specified in the HTML.

In addition, most Arabic and some Hebrew letter forms vary depending on whether a consonant is at the beginning, middle or the end of a word (this was done to make manuscript writing faster). In computing terms, the same letter may need to be displayed in several alternate formats depending on its position in the word.

Note: All alphabets originate from an ancient Semitic script developed by Bronze Age turquoise miners working in Egypt. Although the forms were based on hieroglyphics, the sounds are based on Semitic words, not Egyptian.

Top of Page

Syllabary

A syllabary script is one in which one symbol represents a single syllable (consonant-vowel) sequence. These scripts were developed for languages in which most syllables end in a vowel, so the writing of these languages is more compact. Examples of syllabaries include:

  1. Japanese Katagana & Hiragana - Two parallel syllabaries used in Japanese used to specify pronunciation of Chinese and non-Japanese words. Katagana is the more angular system similar in appearance to Chinese, while Hiragana is the more circular system associated with "women's writing." The Japanese writing system also incorporates Chinese characters.
  2. Cherokee - Developed by the Cherokee scholar Sequoyah as a "native writing" system. The forms are based on the Roman alphabet, but the values are not related to the English values. Another syllabary, the Ojibwa or Canadian Aborigonal syllabary was also invented and variants been used for Ojibwa, Cree and Blackfoot. In the Ojibwa syllabary, the orientation of the letter (up/down/left/right) indicates which vowel comes after it.
  3. Cuneiform - The script used on clay tablets for Sumerian, Babylonian, Akkadian and Hititte. An alphabetic form based on the Sinai Semitic alphabet was used to write the Semitic language Ugaritic (therefore form does not always indicate script type).
  4. Linear B - The script used to write the oldest Greek documents found in Mycaenae. The latter Greeks adopted the Phoenician (Semitic) alphabet where it evolved into the Greek alphabet.
  5. Other Syllabaries - Other societies have independently syllabary scripts, but many have been replaced by the Roman alphabet or some other script.

Although writing is shorter, more symbols are needed because there are more possible combinations of consonant+vowel. The encoding of syllabaries therefore requires larger fonts and different allocations of memory.

Top of Page

Syllabic Alphabet

Scripts in which write consonants as the main letters, but then use special symbols or "vowel marks" to indicate which vowels follow the consonant are called "syllabic alphabets" These are true alphabets in the sense that consonants and vowels are independently written, but because the vowels and consonants combine to form complex characters, they can be difficult to encode.

Most South Asian and Southeast Asian scripts are syllabic alphabets; these include Hindi (Devanagari), Tamil, Gujarati, Bengali, Thai, Balinese, Hmong, Thai, Tibetan and many others. Korean Hangul can also be classified as a syllabic alphabet.

Top of Page

Ideographic Script

In an ideographic script, a character is used to represent one concept, regardless of pronunciation. The Chinese script is considered ideographic, although there are methods to write phonetic pronunciations. The Chinese script is probably the largest in terms of characters, because each concept requires a symbol. Compounding is commonly used to create additional words.

The Roman alphabet does include ideographs for numbers (e.g. 1,2,3) and currency symbols which can be pronounced differently depending on the language in question. Thus the ideograph for 1 can be pronounced as one in English, uno in Spanish, un in French, ein in German and bat in Basque.

Top of Page

Links on Scripts

Top of Page | Encoding Tutorial Index

Previous Page Next Page

Last Modified: Tuesday, 04-Jun-2013 12:41:28 EDT