Home / Blogs

Choosing Internationalized Email Addresses

John Levine

Recently I've been working on Email Address Internationalization (EAI), looking at what software is available (Gmail and Outlook/Hotmail both handle it now) and what work remains to be done. A surprisingly tricky part is assigning EAI addresses to users.

In traditional ASCII mail, the local part of the address, what goes before the @ sign, can be any printable ASCII characters. Although an address like %i()/;~f@examp1e.com is valid, and mail systems will handle it, users don't want addresses like that. A good address is one that is easy to remember, easy to tell someone over the phone, and easy to type.

Mail systems all give senders some help when interpreting addresses. If an address is Bob@example, they'll accept bob@ or BOB@. If the address is joe.smith@, they'll accept Joe.Smith@ and often variations in punctuation like joesmith@ without the dots.

The flip side of this is that you don't assign different addresses that are too similar. While it is technically possible that BOB@ and bob@ could deliver to different mailboxes, nobody does that. Similarly, nobody makes joesmith@ and joe.smith@ different. (They may not both work, but if they do, they're the same mailbox.)

The domain (the part of the address after the @ sign) has to follow the DNS rules, which don't allow any fuzzy matching other than ASCII upper and lower case.

How does all this extend into EAI mail?

EAI extends ASCII addresses in a straightforward way — in addition to any printable ASCII characters the local part can contain any printable UTF-8 characters, and the domain can be UTF-8 U-labels. As before, users will have an easier time if mail systems assign addresses conservatively and interpret addresses on incoming mail liberally.

The PRECIS working group at the IETF defined string classes for different applications. The Identifier class works well for mailbox names, codepoints that are (roughly) letters and digits in various languages.

It also provided rules to prepare UTF-8 strings for use. Unicode often provides multiple ways to represent exactly the same character, e.g., a single codepoint for an accented character é or separate e and accent codepoints. It often also has variant characters that look different but mean approximately or exactly the same thing, such as full-width and half-width versions of characters, Latin digits 12345 and Arabic digits ١٢٣٤٥, or traditional and simplified Chinese characters. To prepare a string, software maps variant codepoints into preferred ones, usually precomposed characters such as é. Mail systems should assign mailbox names in prepared form, but they can and should accept addresses in the incoming mail in any form since they can prepare them as they receive them. (This is different from the DNS where DNS servers only do exact matches, so the client has to do any preparation.)

There's no reason that a mail system's fuzzy matching has to stop where PRECIS and ASCII addresses did. The Latin and Arabic digits aren't the same for PRECIS, but it's easy enough for a mail system to map them together and to ensure that it doesn't issue two mailboxes with digits that collide. In Latin languages with accented or multiple forms of characters (such as the Turkish dotless ı) a conservative mail system would avoid assigning addresses that differ only in the form of a letter, accept all versions of the letter, even ones that aren't valid or equivalent in the user's language. For example, even though Turkish speakers wouldn't write i for ı, correspondents who don't speak Turkish might, and it's easier all around if the slightly misspelled address works. Similarly, in Scandinavian languages the letters O Ø Ö are different, but it'd be a good idea to accept the wrong versions in incoming addresses.

Mail systems have only recently started to assign EAI addresses, and I'm not yet aware of any of them doing fuzzy matching on incoming addresses. But for the same reason, we have found it a good idea to allow jimsmith@ for jim.smith@ in ASCII mail, EAI mail systems will have to figure out how to adapt to however their correspondents type the EAI addresses.

By John Levine, Author, Consultant & Speaker
Related topics: Enum, Multilinguism
SHARE THIS POST

If you are pressed for time ...

... this is for you. More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

Vinton Cerf, Co-designer of the TCP/IP Protocols & the Architecture of the Internet

Share your comments

To post comments, please login or create an account.

Related

Topics

Domain Names

Sponsored byVerisign

IP Addressing

Sponsored byAvenue4 LLC

DNS Security

Sponsored byAfilias

Cybersecurity

Sponsored byVerisign

New TLDs

Sponsored byAfilias