Teaching and Learning with Technology

Computing With Accents and Foreign Scripts

Skip Menu

Raw XML Files

Note: This page covers XML files other than XHTML. Examples include RSS, XML files for Flash and other XML standards.

No Entity Codes

Although the XML standard supports Unicode, it does not support numeric entity codes like ρ for Greek ρ or HTML entity codes like é for accented é

Therefore you have to use keyboard utilities to enter in the Unicode characters raw. Here's how a line of RSS XML code might look for a Spanish name. The next section will discuss how to create a Unicode XML file.

Correct: <title>José Pérez wins grand prize</title>
Incorrect: <title>Jos&eacute; P&eacute;rez wins grand prize</title>

Note: Some applications may translate HTML entity codes in XML files to the codes to the appropriate character, but you cannot rely on this.

Creating a Unicode XML File

Declare XML Encoding

The first line should contain an "utf-8" declaration (in case an application chooses Latin-1 as a default). Here's the UTF-8 declaration for an RSS file

<?xml version="1.0" encoding="UTF-8"?>
<rss>
...
</rss>

Use Text Editors which Support UTF-8

Many older text editors support only ASCII or Latin-1 and cannot display Unicode characters. The result is that Unicode characters will "break-up" into sequences of ASCII characters. See list of recommended editors below.

Recommended Text Editors

These text editors allow you to easily type encoded text then export them as properly encoded HTML or text files.

Windows

  1. Notepad (free with Windows)
  2. StarOffice (Windows/Linux; freeware)
    • To save StarOffice documents as encoded documents, go to File, then Save As. Select the Text Encoded format. In the next window, select UTF-8 encoding.
  3. Other commercial text editors which support Unicode are also available

NOTE: Because of different technical issues, export from Microsoft Word is not recommended.

Macintosh

  1. Apple TextEdit (free with OS X)
  2. BBEdit (Available in CLC Labs)
    • To save a UTF-8 file, click Save As, then the Options button, then the Encoding button, and select UTF-8 in the menu. BBEdit may also prompt you to make the change if your file contains non-Latin-1 characters.
  3. Other text editors designed for foreign language text editing may be able to export encoded text or HTML files.

NOTE: Because of Microsoft HTML formatting issues, export from Microsoft Word is not recommended.

Inserting Unicode Characters

In all the above applications, you can use either keyboard or character insertion utilities to enter the data.

Placing XML Data in XHTML

If you plan to insert your XML data into an XHTML document (or transform it via XSLT), then make sure your XHTML file includes the UTF-8 declaration

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
</head>

This will help ensure your XML Unicode characters are displayed properly even without using numeric entity codes.

Top of Page

Tuesday, 04-Jun-2013 12:41:35 EDT