The <lang=> attribute can be used to declare the language of a Web page or a portion of a Web page. This is meant to assist search engine spiders, page formatting and screen reader technology.
NOTE: You must also declare the encoding in addition to the language. The language and its script are independent.
In the online archive world, there are two primary reasons for associating documents with specific languages – facilitatate global technology and facilitate metadata search in archives. Although the two reasons are valid, they are by no means identical. In some situations, one goal may be more than important than another.
How do you select the right spell checker to use (French vs. English), the right font (Arabic vs. Urdu), the right way to pronounce c'est la vie (French "Say la vee" vs. English "Sest la v-eye" or the right set of "Quote Marks" (English) vs. «Quote Marks» (Spanish)?
You tag documents with a language and program utilities that behave differently depending on the target language identified. This allows the same product (e.g. Microsoft Word) to be used but to include plugin spell checkers for different languages.
The caveat is that only written languages are usually targeted for these kinds of utilities. For instance, Microsoft has utilities for standard American English and standard British English, but not for spoken varieties Brooklyn English. Although a "Brooklyn" spellchecker and "Brooklyn" speech synthesizer could be programmed, many "Brooklyn" native speakers would probably find them condescending and not use them.
Aside from spellcheckers and speech synthesizers, researchers into specific dialects or historical forms need a way to tag their material into very narrow categories that would be irrelevant to most software vendors.
The caveat here is that a tag may be registered, but only supported by a very narrow range of specialized applications. An example of this would be the need for a Celtic database to distinguish Gaulish (xcg) vs. Celtiberian (xce) - two distinct ancient Celtic languages. On the other hand, it is unlikely that any speech synthesizer will pronounce words from these languages correctly.
When deciding how to tag documents, it may be important to consider whether you are tagging for general usage or for a narrow research purpose.
English (U.K./Great Britain)
In XHTML, the language is declared in the HEAD as follows:
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
NOTE: If you are writing in a right-to-left language like Arabic or Hebrew, you should add the dir="rtl" attribute. See the Right Alignment options for more details.
If you switch languages within one page, you can embed the <lang=> attribute in other tags such as a <p>, <h1>, <span> and other tags. For example
This sentence is in English.
This sentence will be read with a British accent
Esta frase es en español. (Spanish)
Cette phrase est en français. (French)
Mae'r frawddeg hon yn cymraeg. (Welsh)
<p>This sentence is in English.</p>
<p lang="en-GB">This sentence will be read with a British accent</p>
<p lang="es">Esta frase es en español.</p> (Spanish)</p>
<p lang="fr">Cette phrase est en français</p> (French)
<p lang="cy">Mae'r frawddeg hon yn Cymraeg.</p> (Welsh)</p>
Language codes are primarily taken from the list of ISO-639 language codes. Some common codes, including all the languages taught at Penn State are listed below. For the most part, they are based on the native name (i.e. Español (es) for Spanish).
This language code list has recently been expanded to a three letter set (e.g. "eng" for English), from an older two-letter set. Therefore, some languages (particularly ancient languages) may have a three-letter code listed.
The By Language pages list the codes for each language, but common codes are listed below.
These are codes where the language name diverges significantly from English.
Note on Screen Reader Support: Only the most recent versions of JAWS and Home Page Reader support the LANG tag for French, Spanish, Portuguese, German and Finnish. To support other languages, it is recommended that users install plug-ins or screen reader software designed for other language.
For some situations though (e.g. China, different "varieties" of German), you may need to use older codes if you need them to be recognized by more software packages.
Language codes can be followed by an optional variety code, but note that not all codes are recognized by all vendors and that the line between "language" and "dialect" can be very fuzzy in some situations.
Until recently, the only way most vendors (e.g. Microsoft or Apple) distinguished languages was by attaching a ISO-3166 country code code after it. Although some "country codes" can be linguistically inaccurate, they may be the most standardized.
Recently, there has been an attempt to codify other types of regional varieties as part of the RFC 4646 project, but it is still a work in progress. Below are some guidelines for forming different types of varieties, but note that not all of them may be registered.
Check Registry First: Before using any subtag, confirm that it has been registered first in the IANA Language Subtag Registry. Otherwise assume it is a tag only you may be using.
If a language can be written in more than one script, then you may need to specify which script is in use, some of which are implemented in modern software systems such as Windows Vista. Common examples (all of which are registered) include:
Many languages written multiple scripts have IANA registered variant, but not all of them do. If your language script variant does not exist, then the following script subtags can be used.
If a regional variety is larger than a country, then it is recommended that region codes from the U.N. Numeric Macroregions List be used. The most prominent example is probably:
Another theoretical example could be en-021 (American and Canadian English), although this variant is NOT registered.
If you need a code not registered with the IANA, you can create new codes following suggested guidelines, but you may need to add an x-prefix to indicate that is it unregistered.
By the way, Anyone can request a new variant code at, but the process is lengthy.
The RFC 4646 permits codes to be combined. So if you need to specify the Baltimore dialect of English, you could create a code such as
Please note that no regional varieties from the United States are registered with the IANA (and only three from Britian).
Thus you can either use the code x-en-US-Baltimore to indicate it is not registered or just en-US-Baltimore depending on your needs. It is very likely most software packages would interpret the string as just en-US.
The RFC 4646 does not specify how to indicate time within a particular language, but some registered codes indicate dates for when spelling changes in a language were enacted. Some examples include:
Use these codes if available for your language
Use these codes if you cannot find an appropriate code for your language in the lists above, then use these.
©Penn State University, 2000-2013.
This Web page maintained by Teaching and Learning with Technology, a unit of Information Technology Services. For questions or comments on this Web page, please contact Elizabeth J. Pyatt (firstname.lastname@example.org).
This site uses Unicode to display non-English characters. This site is best viewed in the most recent versions of your browser.
Unicode character names and hexadecimal entity codes are taken from the public Unicode Character Charts.
This publication is available in alternate media upon request.
Last Modified: Tuesday, 04-Jun-2013 12:41:34 EDT