Teaching and Learning with Technology

Computing With Accents and Foreign Scripts

Skip Menu

Declare the Language

View Language Codes (Bottom of Page)

The <lang=> attribute can be used to declare the language of a Web page or a portion of a Web page. This is meant to assist search engine spiders, page formatting and screen reader technology.

NOTE: You must also declare the encoding in addition to the language. The language and its script are independent.

This Page

  1. Why Language Tags?
  2. HTML Template
  3. Switching Languages
  4. Common Langage Codes
  5. Language Codes for Linguistics or Language Archives
  6. Specifying Dialects and Language Varieties
  7. Links to Code Tables

Why Language Tags?

In the online archive world, there are two primary reasons for associating documents with specific languages – facilitatate global technology and facilitate metadata search in archives. Although the two reasons are valid, they are by no means identical. In some situations, one goal may be more than important than another.

Facilitate Global Technology

How do you select the right spell checker to use (French vs. English), the right font (Arabic vs. Urdu), the right way to pronounce c'est la vie (French "Say la vee" vs. English "Sest la v-eye" or the right set of "Quote Marks" (English) vs. «Quote Marks» (Spanish)?

You tag documents with a language and program utilities that behave differently depending on the target language identified. This allows the same product (e.g. Microsoft Word) to be used but to include plugin spell checkers for different languages.

The caveat is that only written languages are usually targeted for these kinds of utilities. For instance, Microsoft has utilities for standard American English and standard British English, but not for spoken varieties Brooklyn English. Although a "Brooklyn" spellchecker and "Brooklyn" speech synthesizer could be programmed, many "Brooklyn" native speakers would probably find them condescending and not use them.

Facilitate Metadata Search in Archives

Aside from spellcheckers and speech synthesizers, researchers into specific dialects or historical forms need a way to tag their material into very narrow categories that would be irrelevant to most software vendors.

The caveat here is that a tag may be registered, but only supported by a very narrow range of specialized applications. An example of this would be the need for a Celtic database to distinguish Gaulish (xcg) vs. Celtiberian (xce) - two distinct ancient Celtic languages. On the other hand, it is unlikely that any speech synthesizer will pronounce words from these languages correctly.

When deciding how to tag documents, it may be important to consider whether you are tagging for general usage or for a narrow research purpose.

Top of Page

HTML Template

The official W3C recommendation is to declare the primary language for each Web page with a <...lang => attribute in the<html> tag. Codes are ISO-639 codes.

For instance:

Template

<html lang="??">
...
</html>

English (U.S.)

<html lang="en-US">
...
</html>

English (U.K./Great Britain)

<html lang="en-GB">
...
</html>

Spanish

<html lang="es">
...
</html>

XHTML

In XHTML, the language is declared in the HEAD as follows:

<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">

NOTE: If you are writing in a right-to-left language like Arabic or Hebrew, you should add the dir="rtl" attribute. See the Right Alignment options for more details.

Top of Page

Switching Languages

If you switch languages within one page, you can embed the <lang=> attribute in other tags such as a <p>, <h1>, <span> and other tags. For example

Foreign Language Test Text

This sentence is in English.

This sentence will be read with a British accent

Esta frase es en español. (Spanish)

Cette phrase est en français. (French)

Mae'r frawddeg hon yn cymraeg. (Welsh)

Code

<p>This sentence is in English.</p>

<p lang="en-GB">This sentence will be read with a British accent</p>

<p lang="es">Esta frase es en espa&ntilde;ol.</p> (Spanish)</p>

<p lang="fr">Cette phrase est en fran&ccedil;ais</p> (French)

<p lang="cy">Mae'r frawddeg hon yn Cymraeg.</p> (Welsh)</p>

Top of Page

Some Common Language Codes

Language codes are primarily taken from the list of ISO-639 language codes. Some common codes, including all the languages taught at Penn State are listed below. For the most part, they are based on the native name (i.e. Español (es) for Spanish).

This language code list has recently been expanded to a three letter set (e.g. "eng" for English), from an older two-letter set. Therefore, some languages (particularly ancient languages) may have a three-letter code listed.

The By Language pages list the codes for each language, but common codes are listed below.

Commonly Taught Languages

Other Codes

These are codes where the language name diverges significantly from English.

Note on Screen Reader Support: Only the most recent versions of JAWS and Home Page Reader support the LANG tag for French, Spanish, Portuguese, German and Finnish. To support other languages, it is recommended that users install plug-ins or screen reader software designed for other language.

Top of Page

Language Codes for Linguistics or Language Archives

If the code you need is not listed in the ISO-639 language page then refer to the larger SIL ISO-639-3 list. This list was released in 2007 and includes many more languages than previous lists.

For some situations though (e.g. China, different "varieties" of German), you may need to use older codes if you need them to be recognized by more software packages.

Top of Page

 

Specifying Language Dialects and Varieties

Language codes can be followed by an optional variety code, but note that not all codes are recognized by all vendors and that the line between "language" and "dialect" can be very fuzzy in some situations.

By Country

Until recently, the only way most vendors (e.g. Microsoft or Apple) distinguished languages was by attaching a ISO-3166 country code code after it. Although some "country codes" can be linguistically inaccurate, they may be the most standardized.

RFC 4646 Tag Syntax

Recently, there has been an attempt to codify other types of regional varieties as part of the RFC 4646 project, but it is still a work in progress. Below are some guidelines for forming different types of varieties, but note that not all of them may be registered.

Check Registry First: Before using any subtag, confirm that it has been registered first in the IANA Language Subtag Registry. Otherwise assume it is a tag only you may be using.

By Script

If a language can be written in more than one script, then you may need to specify which script is in use, some of which are implemented in modern software systems such as Windows Vista. Common examples (all of which are registered) include:

Many languages written multiple scripts have IANA registered variant, but not all of them do. If your language script variant does not exist, then the following script subtags can be used.

By Numeric World Region

If a regional variety is larger than a country, then it is recommended that region codes from the U.N. Numeric Macroregions List be used. The most prominent example is probably:

Another theoretical example could be en-021 (American and Canadian English), although this variant is NOT registered.

Unregistered Codes

If you need a code not registered with the IANA, you can create new codes following suggested guidelines, but you may need to add an x-prefix to indicate that is it unregistered.

By the way, Anyone can request a new variant code at, but the process is lengthy.

Dialects within a Country

The RFC 4646 permits codes to be combined. So if you need to specify the Baltimore dialect of English, you could create a code such as

Please note that no regional varieties from the United States are registered with the IANA (and only three from Britian).

Thus you can either use the code x-en-US-Baltimore to indicate it is not registered or just en-US-Baltimore depending on your needs. It is very likely most software packages would interpret the string as just en-US.

By Era

The RFC 4646 does not specify how to indicate time within a particular language, but some registered codes indicate dates for when spelling changes in a language were enacted. Some examples include:

Top of Page

Links

ISO-639 Two-Letter Language Codes

Use these codes if available for your language

ISO-639-3 Codes for Linguists and Archivists

Use these codes if you cannot find an appropriate code for your language in the lists above, then use these.

Top of Page

Last Modified: Tuesday, 04-Jun-2013 12:41:34 EDT