Internationalised Domain Names - Bringing the World to the Web
Created | Updated Nov 11, 2011
One of the most fundamental parts of the design of the Internet is the domain name, or website address. Take, as an example, the web address, or URL1, http://www.bbc.co.uk/dna/h2g2/. If you enter this address into your browser, it will separate the name into its components. First there's the scheme (http), then the domain name (www.bbc.co.uk), and then there's the specific resource (/dna/h2g2/). The domain name can in turn be broken down into a series of labels separated by the delimiter (the dot). The labels are uk, co, bbc, and www, in that order.
The browser then looks up the domain name using the Domain Name System (DNS) servers to find the IP address, a series of numbers to identify the computer responsible for the domain. Then it sends a request to that computer for the resource (in our example, /dna/h2g2/). Simple enough.
The Problem
Many users of the Internet do not speak English as a first language2. And, as Asia goes online at an ever-increasing rate, there is greater and greater desire to make the Web more multilingual. The problem is that the DNS protocol, which maps names to numbers, only understands ASCII. ASCII, the American Standard Code for Information Interchange, is a simple character encoding which includes only the Latin alphabet. There are no accent marks, such as á, å, ç, ê, ø, or ù, and no extra letters such as the Icelandic and Old English æ, ð, and þ or the German ß. And certainly there's no support for Arabic, Greek, Cyrillic, or Chinese characters.
So how can the web be made easier for people who don't natively use the Latin character set? After much thought and debate, a solution was outlined in RFC 34903.
The Solution
Unicode is another character encoding which includes a vast array of scripts. ASCII character encodings map to equivalent Unicode code points, so ASCII is a subset of Unicode. The solution uses Unicode domain names4.
Internationalised Domain Names are a client-side solution5. In other words, the technical change is not in the DNS servers, but in individual browsers. When a domain name is entered into the browser's address bar, the browser should, before querying the DNS servers, perform the following operations. First, the name is split into labels, as usual. However, the normal full stop (dot) is no longer the only delimiter:
Whenever dots are used as label separators, the following characters MUST be recognised as dots: U+002E (full stop), U+3002 (ideographic full stop), U+FF0E (fullwidth full stop), U+FF61 (halfwidth ideographic full stop).6
This helps to reduce confusion.
Each label is then processed by a function called ToASCII. If the name already consists entirely of ASCII characters, the function ToASCII leaves it unchanged. (This means that applying the function multiple times has the same effect as applying it just once.) However, if there are Unicode characters beyond the ASCII range, this function will alter the label, creating what is known as an ACE (ASCII Compatible Encoding) label. The ACE label is a string of ASCII characters which represents the original text. It is ugly-looking, and should not be shown to the user (unless the user is a geek who specifically requests it). ACE labels begin with the ACE prefix, the four-character sequence xn--.
The ACE label is then used in the DNS query. Another function, called ToUnicode, is then called. This prepares the name which will be shown to the user.
The ToASCII operation is used before sending an IDN to something that expects ASCII names ....
The ToUnicode operation is used when displaying names to users ...
Technical Details of the ToASCII function
The ToASCII function uses the nameprep function. Remember that URLs are case-insensitive? Computers know what that means for ASCII text, and can convert H2G2.COM to h2g2.com. Nameprep, defined in RFC 3491, outlines equivalent normalisations for other scripts. ToASCII also uses the punycode function, defined in RFC 3492. Bootstring is a general method for representing a sequence of code points within lower code points. ASCII is a subset of Unicode, so bootstring can be used to represent Unicode text in ASCII characters. This implementation of Bootstring is called punycode. (Punycode includes algorithms for both encoding and decoding.)
A string of text encoded by punycode will contain first all the ASCII characters in the original string (if there were any), then a delimiter (the hyphen), and then another list of ASCII characters which encode both the original Unicode code point and the position in the original string at which that Unicode character should be inserted. This means that a punycode-encoded string may start with a hyphen. Domain name labels should not start with hyphens, but the string xn-- is prefixed to the punycode, so that's alright.
- If all code points in the sequence are in the ASCII range then skip to step 3.
- Perform the steps specified in nameprep and fail if there is an error.
- Verify that the sequence begins with the ACE prefix, and save a copy of the sequence.
- Remove the ACE prefix.
- Decode the sequence using the decoding algorithm in punycode and fail if there is an error. Save a copy of the result of this step.
- Apply ToASCII.
- Verify that the result of step 6 matches the saved copy from step 3, using a case-insensitive ASCII comparison.
- Return the saved copy from step 5.
Technical Details of the ToUnicode Function
While the ToASCII function may fail in steps 2 or 5, the ToUnicode function does not fail. If the input is not understood, it is merely returned unaltered. So if gobbledygook is sent to the function, gobbledygook will be returned by the function. However, generally the ToUnicode function will be called on the output of the ToASCII function, so it's not dealing with gobbledygook.
- If all code points in the sequence are in the ASCII range then skip to step 3.
- Perform the steps specified in nameprep and fail if there is an error.
- Verify that the sequence begins with the ACE prefix, and save a copy of the sequence.
- Remove the ACE prefix.
- Decode the sequence using the decoding algorithm in punycode and fail if there is an error. Save a copy of the result of this step.
- Apply ToASCII.
- Verify that the result of step 6 matches the saved copy from step 3, using a case-insensitive ASCII comparison.
- Return the saved copy from step 5.
The Next Problem
So, that's all well and good then. We can now use Arabic, Urdu or Cyrillic characters in domain names, and they'll show up correctly in all modern browsers, yes? No. Many browsers which implemented IDNAs later turned them off again. Why?
One of the main incentives was the website www.xn--pypal-4ve.com. In case you're wondering, the code 4ve inserts the Cyrillic character а in the second position, meaning that the site name appears in the browser address bar as www.pаypal.com. The Cyrillic character а looks, in most fonts, identical to the Latin character a, which means that this site could spoof the famous commercial website www.paypal.com. This was the most prominent of many examples of the potential to use IDNAs for phishing attacks.
It has always been possible to spoof domain names with similar looking characters (homographs). Previously, however, this was limited to ASCII characters, such as l and 1. With the advent of IDNAs, there is far more potential for confusion.
The Next Solution?
This led to frenzied discussion among browser manufacturers. One obvious solution was to disallow IDNs which contained mixed scritps, such as Cyrillic and Latin. That's initially tempting, but won't work for Japanese, which is often written in three scripts at a time.
Safari decided to disable Cherokee, Cyrillic, and Greek scripts by default. These three scripts all have many Latin lookalike characters. This is well and good if you have a majority English-speaking audience, but not ideal for the rest of the world. And it certainly doesn't solve all the problems. Latin characters with unusual diacritics will still work, and some fonts leave out some of these diacritics, so that l and l-cedila may well look identical.
Opera, and later Firefox, took the position that this was a register issue: Internet registers should not allocate domains such as www.xn--pypal-4ve.com in the first place. These browser manufacturers maintain a whitelist of 'well behaved' registrars, who don't allow spoofing domains in their registry. For example, it would be impossible to register www.xn--pypal-4ve.info, because the .info registrar wouldn't allow it. Opera's whitelist of TLDs7 is built into the browser. Firefox's list is on display. It's worth noting that the most popular registrar, .com, allocates domains strictly on a first-come-first-served basis, with no checks at all. Therefore, IDNs in .com will not work in Opera or Firefox.
Google's new Chrome browser has IDNAs turned off by default.
Internet Explorer is, of course, more integrated with the operating system than most browsers are, so it checks what language supports are configured within Windows. If the script in the IDN is part of the user's configured accept language, the name will display as an IDNA. Otherwise, the punycode will be shown. However, when scripts are mixed (Cyrillic and Latin in the same label, for example), the punycode will be shown, even if Cyrillic (in our example) is usually accepted. Some scripts which look nothing like Latin are allowed to mix, as these don't present a threat.