Home / Blogs

Putting String Similarity into Context: Bulgaria’s IDN (.??) vs. Brazil’s ccTLD (.br)

Bulgaria is a nation which is directly impacted by the current Fast Track automatic disqualification when Top-Level Domain (TLD) strings are “confusingly similar” to other TLDs, in this case an Internationalized Domain Name (IDN) country code Top-Level Domain (ccTLD). Bulgaria has already been declined twice (in late 2009, and in May of 2010) to register the *.?? Cyrillic IDN on the premise that it looks confusingly similar to Brazil’s *.br ASCII TLD.

Being a native Bulgarian, I did not see how these two strings are similar—nor confusing for that matter—so a research on how ICANN determines a confusingly similar string was due. While reviewing the ICANN rules, it hit me that a very important part of the comparison was left out, namely how these strings will be used.

Before going into this, let me start with a few words on where the problem lies, i.e. why ICANN finds these strings to be confusingly similar.

Similarities and differences between Cyrillic and Latin characters – The Cyrillic letter ? does not look like a b; it actually looks much more like the number 6, however every person who speaks a Cyrillic language will recognize the difference between the two letters and the number, especially when put into context (Click to Enlarge).The world population that speaks Cyrillic languages

Although a Latin-speaking user can certainly find these strings quite similar, a Cyrillic speaking person will know which one is which. The Cyrillic letter ? does not look like a b (see my comparison of the Latin and Cyrillic alphabets); it actually looks much more like the number six 6, however every person who speaks a Cyrillic language will recognize the difference between the two letters and the number, especially when put into context (again, more on this later).

The difference between the subsequent top-level domain letters ? and r are not as noticeable in regular fonts, but are very noticeable in hand-written and italic fonts. Still, a person who knows a Cyrillic language will know the difference. This case is even more obvious in hand-written and italic fonts:

.?? vs .br

As a result, it seems that the population that speaks Latin languages is the one finding these strings confusingly similar, which has resulted in the ICANN rules for string similarity, but without taking into consideration the population that speaks Cyrillic languages.

The population that speaks Latin languages

A major point that ICANN is missing in their current evaluation criteria for confusingly similar strings is that they do not review the TLDs, especially IDNs, in the context they will be used in. When reviewing an IDN in context, the evaluation of the string (and its alphabetical differentiation) becomes much clearer and easier. As an example, let’s look at how ? company’s domain would look like in Latin and Cyrillic IDNs:

company.br
????????.?? (???????? (BG) = company (ENG))

I doubt that someone will mistakenly take one for the other. Still, let’s analyze this in more detail and review some extreme similarity cases.

Brazil’s IDN vs. Bulgaria’s IDN

The main reasons that differentiate Brazil’s IDN from the Bulgarian IDN are:

  • A URL consists of a top-level domain and a second-level domain. Since .?? and .br are just top-level domains, they are meaningless without a second-level domain. When comparing full URLs, the difference between the two is exceptionally obvious. Example: company.br and ????????.??
  • Brazil uses three tier domains (host+gTLD+ccTLD), whereas Bulgaria uses two-tier domains (host+ccTLD), which makes the visual gap between the two even larger. As a result, a Brazilian user looking at a Bulgarian URL will know right away that this is not a Brazilian domain, even if the host uses the same letters. Example 1: Vivo is one of Brazil’s mobile network operators. Their site is vivo.com.br which in Bulgarian would be ????.??. There is no resemblance between the two. Example 2: An imaginary company called American Electric has registered ae.com as its main domain. Its Bulgarian domain would be ??.??, which does not resemble its Brazilian counterpart ae.com.br, even though the host is exactly the same. Even if Bulgaria starts using three-tier domain names (host+gTLD+ ccTLD), this URL will look like ??.???.??, which is also decidedly not the same as the Brazilian domain.

The Extreme Case of string similarity

IMPORTANT NOTE: The analysis below is excessive, and this is on purpose, because it could happen. It raises the importance of having regulations in the case that such situations arise in the future. This analysis presumes that Brazil uses two-tier domain names (host+ccTLD), and that there is a company with a domain string that is exactly the same in Cyrillic and Latin languages.

  • If a non-native English speaker (such as a Frenchman or a Spaniard) sees ae.??, but knows the context where the URL is used/mentioned, s/he will most probably know that this is a Cyrillic/Bulgarian domain. No action here.
  • If a non-native English speaker (such as a Frenchman or a Spaniard) or a native English speaker sees ae.?? without knowing the context where the domain is used/mentioned, s/he may think that it is in Latin.

In such cases, regulation (which is ICANN’s strength) should be in place to control the use of these strings and to ensure that a single registrant owns visually similar domains. In addition, browser vendors need to update their error message in case ae.?? is entered in Latin letters in the browser, and there is no such domain. The error message should reflect that the domain may be in Cyrillic. Here is an example for a possible error message:

Server not found
Firefox can’t find the server at www.ae.bg.

  • Check the address for typing errors such as ww.example.com instead of www.example.com
  • Check the address for being in Cyrillic such as ??.?? instead of ae.br
  • If you are unable to load any pages, check your computer’s network connection.
  • If your computer or network is protected by a firewall or proxy, make sure that Firefox is permitted to access the Web.

ICANN Staff’s reasoning on *.??

ICANN staff’s reasoning for declining Bulgaria is that “internet is a world resource and uniqueness is most important.” However, its decision will have an impact on at least 7 million Bulgarians, not to mention their relatives and the Bulgarian-speaking population around the world. In addition, with the IDN ccTLD Fast Track Process ICANN wants to open the Internet to languages based on scripts other than Latin in order to make it more accessible, but at the same time impose limitations on its openness, thus effectively contradicting itself.

The good news is that ICANN is open for feedback (I have already submitted these comments to the ICANN), so hopefully these findings will make it into the ccTLD application and Fast Track review later this year. I will nevertheless appreciate your thoughts on this, so please leave a comment.

The history of the Cyrillic alphabet

To finish off, I would like to give you a little background on the Cyrillic alphabet.

The Cyrillic script is an alphabet developed in the 9th century by two brothers, Cyril and Methodius, who were later on venerated in the Eastern Orthodox Church as saints. The Cyrillic alphabet was first adopted by Bulgaria, my home country, and because of that Cyrillic is believed to be a Bulgarian alphabet, although this is debatable. The Cyrillic script is used in the Slavic nations of Belarus, Bosnia, Bulgaria, Russia, Serbia, Macedonia, Montenegro, and Ukraine, and in the non-Slavic nations of Moldova, Kazakhstan, Uzbekistan, Kyrgyzstan, Tajikistan, Tuva, and Mongolia. With the accession of Bulgaria to the European Union on 1 January 2007, Cyrillic became the third official alphabet of the European Union, following the Latin and Greek alphabets. It is also one of the few alphabets that has its own holiday (May 24th), which is celebrated internationally.

By Vassil Petev, Unit Manager at Telerik

Filed Under

Comments

Cultural meaning Kevin Murphy  –  Aug 11, 2010 4:41 PM

Good stuff. Some interesting points raised.

Are there any cultural reasons why a second-choice TLD, such as .???, would be an unacceptable alternative?

I can imagine UK users being upset if we had to settle for something culturally meaningless like “.ukm” because .uk looked too much like an existing ccTLD. But I think it would be far less objectionable if we were given .gbr instead of .gb.

Is ?? used in Bulgaria in a similar way? Do any of the alternatives have meaning?

Proper name for Bulgaria? Paul Hoffman  –  Aug 11, 2010 5:48 PM

Is ?? the proper name for Bulgaria, or just an abbreviation? If the latter, is it so common that it should be used as the TLD instead of the proper name?

According to Wikipedia (an admittedly flawed source), the proper name of Bulgaria is ????????. That seems like a better TLD, at least to my non-Bulgarian eyes.

I think it's pretty obviously not the Kevin Murphy  –  Aug 11, 2010 11:17 PM

I think it's pretty obviously not the full proper name for Bulgaria.

?? is the abbreviation Vassil Petev  –  Aug 12, 2010 10:55 AM

?? is the abbreviation, just as UK stands for United Kingdom. I believe that using the full country name as the TLD will look awkward. A few examples: mycompany.bulgaria mycompany.bg mycompany.co.unitedkingdom mycompany.co.uk

New abbreviations? Paul Hoffman  –  Aug 12, 2010 2:40 PM

The new TLD equivalents should be full names, not abbreviations. How many abbreviations should a country get? How many per script? Many languages don't have proper abbreviations, so allowing them puts countries with those languages at a distinct disadvantage. I disagree that the full name of Bulgaria "will look awkward"; you are fortunate to have a name shorter than "United States of America". We want to get away from the silly two-letter codes that were picked when space was limited and non-ASCII encodings were non-universal; going to abbreviations again seems counter-productive.

Agreeing with the logic of the author, but also notice .gr and others have homonym similarity Jothan Frakes  –  Aug 11, 2010 10:03 PM

For context, I looked at a few strings next to Bulgaria’s and noticed some slight similarities, but we seem to work past these in the western-iso character uses.

?? gr br gt bt ht hr

In reading what I what seemed to me like a very well articulated description by Vassil of the differences, I also realized that there are two entirely different keyboards and languages that would be used by the average web user that might visit any of these websites.

I’d see it more likely to confuse company.hr with company.br visually.  These are possible to compose without switching entry systems when typing a location.  ‘H’ and ‘B’ are reasonably close keyboard neighbors on a US ‘QWERTY’ type keyboard.

On the other hand, it seems rather impossible to typo the two strings br and ??.  You have to, on most keyboards, deliberately toggle between character composition in western and native character sets.

If one were composing in Cyrillic characters for the TLD, they are 99.999% likely to be composing the entire string in Cyrillic.  If the user is composing in English or Portuguese, an entirely different string composition would be happening.

I find it extremely unlikely that someone would deliberately cross-compose these character sets.  Most competent browsers expose the punicode (xn—blahblahblah) when strings cross code pages like this, so it is not as though the end user will be tricked in some phishing or otherwise nefarious manner.

It seems to me very likely that if .?? constrained its registrations to Cyrillic pages, and identified those code points to application developers, this should create separate walled areas for each rich culture to experience the internet in their native language.

For whatever my input is worth, I think this is worth a second look before declining .?? IF it is possible to perhaps recommend a constrained character set.

Excellent points Vassil Petev  –  Aug 12, 2010 10:59 AM

Excellent points! Thank you for your comment and for standing up for this clause!

Isn't the concern phishing links rather than Kevin Murphy  –  Aug 12, 2010 11:02 AM

Isn't the concern phishing links rather than typos?

I think the thing to note is Kim Davies  –  Aug 11, 2010 10:15 PM

I think the thing to note is that under current ccTLD policy, there is no mandatory coordination between the operators of “.br” and “.??” to guard against conflicts, and ICANN is not in a position to enforce registration rules in a specific country due to the way registration policy is delegated to countries in ccTLD-space. Therefore, while honest registrants may have no need to register potentially confusing names across the two namespaces, it is not honest registrants that confusingly similar guards are designed to protect against. Let’s assume “??” and “ab” are two country-code domains. Who is to say .?? won’t have a policy that permits com.?? to be registered by someone who wishes to use ae.com.?? for confusing similarity to ae.com.ab?

No hat, just trying to focus the problem space.

letting the legacy ccTLDs remain intact and complimenting them with IDN ccTLD Jothan Frakes  –  Aug 11, 2010 11:59 PM

The ISO 3166 ccTLDs are there and exist, many with legacy relationships that are what they are with RFC1591 etc.  For all intents and purposes, these just all simply exist and are what they are.

I wouldn’t suggest retrofitting .BR or any existing ccTLD with anything, but because there are new TLDs being created, these could have new self-imposed rules.

The newcomer could find a way to make it work within the existing framework of strings.  Better put, they’d be motivated to do so because it would help their likelihood of the application moving forward.  In this case it would be submitting the codepages and characters that would be legal within that new IDN ccTLD.

So hypothetically, if an applicant who were declined in an initial IDN ccTLD application where there were conflict, and they saw a means to reduce or eliminate said conflict in a reasonable manner, would it stand to reason that the initial basis for the application’s denial might be reconsidered?

Illegal use Volker Greimann  –  Aug 17, 2010 12:25 PM

The author is correct in assuming ICANN should review the terms “confusingly similar” based on the way it will be used, to be more precise, in the way it may be abused.

Having a new IDN TLD look similar to a ccTLD poses a security risk for phishing. Even today similar domain names are used to defraud customers and make them think they actually reached the right address. Many users will only check look at the address once and if it is similar to what they expect, they would not think twice about entering their payment details into, say: paypal.?? instead of paypal.com.br.

Saying that .br currently only allows registrations in the third level just confuses the issue. First, the policy may be changed tomorrow (see the liberalization of .co), rendering your point moot. Second, many people cannot be expected to know which levels domains are free for registration in what TLD. Many TLDs offer second and third level domain names (say .com.fr and .fr).

I believe .?? is confusingly similar to .br. Maybe not for those fluent in the cyrillic alphabet, but for those not fluent in it, it will be as it looks very similar at first glance and the first glance is the only review many domain names get from the user. Considering the asian region where many users only learn western characters as a second alphabet, there is even more potential for confusion.

It is not reasonable for ICANN to Michael Dillon  –  Aug 18, 2010 9:48 AM

It is not reasonable for ICANN to deny .?? to Bulgaria. Instead they should issue this TLD with some simple conditions that .?? will not allow the following 4 domain names to be registered. sot.bg, som.bg, iet.bg and pet.bg because in Cyrillic, these could be confused with com.br (???.?? ???.??) and net.br (???.??, ???.??).
These may not appear that confusing on your screen depending on the Cyrillic font but the confusion comes from the fact that lower case italic “t” looks like “m” in most fonts, and lower case italic “p” looks like “n” in many fonts.

In fact, the conditions could simply specify that these confusing domains should be registered for free in perpetuity to registro.br and its successors. For the future, if registro.br decides to open up a new 2nd level domain, it would be up to them to avoid confusing ones, and if they choose one that is not registered in .?? then they should also get a free registration under the same terms. This free registration should be limited to no more than 10 3-character or 2-character 2nd level domains. Or maybe registro.br should get right of first refusal on any .?? registrations for 2 or 3 character 2nd level domains.

The point is that this is not sufficient reason to deny a registration, just reason to set some reasonable conditions and require that registro.br is happy with them.

Also, one point missed was that Cyrillic has a letter ? which in most fonts looks identical to English “b”. Therefore anyone who reads Cyrillic would never confuse ?? with br because that first letter can only be ? not ?.

And a lot of the 280 million Russian speakers in the world can read and understand a lot of Bulgarian content so this is not just an issue about a small country of 7 millon. I speak Russian as a second language and I can read an arbitrary Bulgarian language site and understand about 70% at first reading. Most languages using a Cyrillic alphabet are Slavic languages and the differences between them are more like between Italian and Spanish, especially the written languages. So please do not think of this as a minority language issue. Ukrainian, Belarussian, Serbian and Russian also use Cyrillic.

paypal.?? Michael Dillon  –  Aug 18, 2010 9:52 AM

Could paypal.?? really be an issue?

Not if .?? requires fully Cyrillic domain names.

??????.?? is about the closest that you can get. Compare with
paypal.br

Comment Title:

  Notify me of follow-up comments

We encourage you to post comments and engage in discussions that advance this post through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can report it using the link at the end of each comment. Views expressed in the comments do not represent those of CircleID. For more information on our comment policy, see Codes of Conduct.

CircleID Newsletter The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

VINTON CERF
Co-designer of the TCP/IP Protocols & the Architecture of the Internet

Related

Topics

Brand Protection

Sponsored byCSC

Domain Names

Sponsored byVerisign

DNS

Sponsored byDNIB.com

Cybersecurity

Sponsored byVerisign

Threat Intelligence

Sponsored byWhoisXML API

New TLDs

Sponsored byRadix

IPv4 Markets

Sponsored byIPv4.Global