Home / Blogs

Email in the World's Languages - Part II

Don't miss a thing – sign up for CircleID Weekly Wrap newsletter delivered to your inbox once a week.
John Levine

In our last installment we discussed MIME, Unicode and UTF-8, and IDNA, three things that have brought the Internet and e-mail out of the ASCII and English only era and closer to fully handling all languages. Today we'll look at the surprisingly difficult problems involved in fixing the last bit, internationalized e-mail addresses.

All of the extensions described up to this point have been backward compatible with the existing mail delivery infrastructure. MIME encodes all of the extended characters and files as ASCII text, so although you need a MIME-aware mail program on your PC (or web mail or whatever), the mail systems through which the mail passes don't have to know anything about MIME. Similarly, non-ASCII IDNA domains are encoded as funny looking ASCII A-labels starting with xn-- so, again, although applications have to know how to turn UTF-8 into A-labels, the underlying DNS software just sees the ASCII.

Although SMTP changes very slowly, it does have a process to add new optional features, with a way for clients and servers to tell each other which options they support and are using. In 1993, a new feature called 8BITMIME added a simple but surprisingly useful option to send mail with 8-bit characters. It didn't change any of the other rules, but it did have the effect of allowing any character set that uses the ASCII codes for carriage return, line feed, and period, (as do nearly all but EBCDIC), in mail message bodies. On computers using eight-bit bytes internally (all of them, these days) the code to support 8BITMIME is simple enough that all popular mail systems support it. So if we want to send eight-bit character code in messages, SMTP can handle it.

The main remaining bit of international e-mail that SMTP can't yet handle is the actual e-mail addresses, both the ones in the message headers such as To: and From: and the ones in the SMTP session. If a mail system is going to be fully international, it also needs to allow non-ASCII text in prompts, error messages, and anywhere else that strings are passed around for presentation to users. This turns out to affect pretty much every piece of software in the e-mail ecosystem, the MUAs (user programs), submission agents that inject mail from user programs, MTAs that move mail from one site to another, and POP and IMAP servers that let MUAs pick up incoming mail. The changes to POP and IMAP and user programs to be language independent are relatively straightforward so I'm going to concentrate on the tricky parts: the format of mail messages, and the SMTP delivery process.

For quite a while people attempted to invent ways to do non-ASCII addresses that were more or less backward compatible with legacy mail. Early work started in China and Japan, where ASCII-only mail was and is particularly unsatisfactory, and initially used simple approaches like allowing various eight-bit extended character sets in mail headers and addresses. Those character sets, such as ISO-2022-x, use sequences of control characters to switch among various "pages" of the character set, so that the meaning of any particular character depends on what shifts preceded it. These didn't work well for a variety of reasons: the shifts mean that a string's meaning depends on what shift state it's in, something that was often implicit or ambiguous, the same set of characters can be represented in many different ways by adding or deleting shifts, and the various experiments tended not to pay sufficient attention to what happened if mail was sent to a mail system that didn't have their extensions and didn't understand their extended character set.

UTF-8 doesn't have the shift problem, but it still uses characters outside the ASCII set that won't work with legacy mail systems. The next approach was to allow UTF-8 addresses, but encode them as ASCII. For domains, that's already done with A-labels and punycode; every non-ASCII domain has an ASCII version that starts with xn--. Unfortunately, that doesn't work with mailboxes (the part of the address before the @.) For one thing, any special characters one might use to mark encoded UTF-8 names have already been used somewhere to mean something else. Hyphens and plus signs are used to mark sub-addresses, exclamation points for uucp addresses, percent signs for source routing, slashes for X.400 compatible addresses, equal signs for VERP bounce addresses, and so on. Even if you could find a sequence of marker characters so arcane that nobody had ever used it, mailboxes are limited to 64 ASCII characters, and by the time you encoded UTF-8 as something along the lines of punycode, the length of the UTF-8 address would be limited to about 18 characters, which is rather short, particularly since people with non-ASCII addresses will want to use all of the same tricks that we ASCII users use with sub-addresses and other address extensions. So that doesn't work, either.

The most completely worked out experiment, documented in RFCs 5335 and 5336 added a new SMTP option called UTF8SMTP, in which email addresses can be almost arbitrary UTF-8, subject to the 64 byte limit on the mailbox, and the domain has to be one that can be turned into A-labels. Messages are sent using 8BITMIME, and UTF-8 can occur nearly anywhere in the message that ASCII can. Everyone with a UTF-8 address can also have an ASCII address, and one can address mail to both of them, with syntax like this:

To: <utf8mailbox@utf8domain <asciimailbox@asciidomain>>

When sending mail to a UTF8SMTP server, a client sends mail to the UTF-8 address, but provides the ASCII address as well. If the mail has to go to a server that does not handle UTF8SMTP, a complex set of downgrade rules turns any headers with UTF-8 into Downgraded-whatever: headers with MIME encoded versions of the UTF-8, replaces any UTF-8 addresses in the To: and From: lines with comments, and sends it along to the ASCII address.

While it was a magnificent piece of kludgery, after a while it became clear that this dual address system didn't work either. One reason was that it was very hard to keep the UTF-8 and ASCII addresses straight, so UTF-8 mail tended to leak into ordinary SMTP traffic, and downgraded mail into UTF-8 traffic. Furthermore, even though the dual addresses were intended as a temporary transition feature, on the Internet, no transition feature ever goes away, and Japanese and Chinese users quite reasonably did not want a system that would require that they have an ASCII address forever.

So the EAI working group threw away most of the kludges and is nearly done defining the permanent internationalized mail system for the Internet. In our next installment, we'll see how your future mail will work.

By John Levine, Author, Consultant & Speaker. More blog posts from John Levine can also be read here.

Related topics: DNS, Email, Multilinguism



To post comments, please login or create an account.

Related Blogs

Related News

Explore Topics

Dig Deeper


DNS Security

Sponsored by Afilias
Afilias Mobile & Web Services

Mobile Internet

Sponsored by Afilias Mobile & Web Services


Sponsored by Verisign

Promoted Posts

Now Is the Time for .eco

.eco launches globally at 16:00 UTC on April 25, 2017, when domains will be available on a first-come, first-serve basis. .eco is for businesses, non-profits and people committed to positive change for the planet. See list of registrars offering .eco more»

Boston Ivy Gets Competitive With Its TLDs, Offers Registrars New Wholesale Pricing

With a mission to make its top-level domains available to the broadest market possible, Boston Ivy has permanently reduced its registration, renewal and transfer prices for .Broker, .Forex, .Markets and .Trading. more»

Industry Updates – Sponsored Posts

Global Domain Name Registrations Reach 329.3 Million, 2.3 Million Growth in Last Quarter of 2016

Neustar to be Acquired by Private Investment Group Led by Golden Gate Capital

Government Guidance for Email Authentication Has Arrived in USA and UK

ValiMail Raises $12M for Its Email Authentication Service

Don't Gamble With Your DNS

Verisign Releases Q2 2016 DDoS Trends Report - Layer 7 DDoS Attacks a Growing Trend

How Savvy DDoS Attackers Are Using DNSSEC Against Us

Radix Adds Dyn as a DNS Service Provider

Port25 Announces Release of PowerMTA V4.5r5

Dyn Partners with the Internet Systems Consortium to Host Global F-Root Nameservers

Verisign Announces .コム Domain Names Are Now Available for Anyone to Register

New Case Study: Jobtome.com Replaces 30 Postfix Servers with a Single PowerMTA

Is Your TLD Threat Mitigation Strategy up to Scratch?

Verisign Launches New gTLDs for the Korean Market, .닷컴 and .닷넷

Verisign Opens Landrush Program Period for .コム Domain Names

Domain Management Handbook from MarkMonitor

An Update on Port25 and the Future of PowerMTA - One Year Later​

Encrypting Inbound and Outbound Email Connections with PowerMTA

What Holds Firms Back from Choosing Cloud-Based External DNS?

V12 Group Sustains Customer Satisfaction by Deploying PowerMTA for Launchpad Platform