PDA

View Full Version : Devanagari Encoding Mystery


blastfromthepast
6th May 2006, 03:10 AM
साड़ी.com
xn--12bmg5i.com
136 google

साड़ी.com
xn--e2b9bngm.com
436 google

These domains look the same in devanagari, but the unicode and pubycode is different, and google finds different pages for each.

Does anyone know why?

thegenius1
6th May 2006, 03:19 AM
साड़ी.com
xn--12bmg5i.com
136 google

साड़ी.com
xn--e2b9bngm.com
436 google

These domains look the same in devanagari, but the punycode is different, and google finds different pages for each.

Does anyone know why?

Yes the Little Dots on the Bottom are spaced Different

blastfromthepast
6th May 2006, 03:31 AM
Yes the Little Dots on the Bottom are spaced Different

Can you make a screen shot? They look the same on my system.

thegenius1
6th May 2006, 03:40 AM
Can you make a screen shot? They look the same on my system.


Well after i slapped them into Microsoft Word this became an Enigma , but by closly looking at what you posted i can clearly see that the top ones "Dot" is further from the " S " and the 2nd ones is closer

IDN.TV
6th May 2006, 04:00 AM
Well after i slapped them into Microsoft Word this became an Enigma , but by closly looking at what you posted i can clearly see that the top ones "Dot" is further from the " S " and the 2nd ones is closer

Both mean sari, a traditional dress worn by indian women, Ok, I think I know the answer, if I am wrong correct me.

Both are written the same- they are for sure the same, but one is written in hindi and the other is probably written in sanskrit or any other language with the same script as hindi, now when google maps it for searches it converts it into puny code and sees them only as a punycode, and the puny code for the first one is different from the second one, so different results, they mean the same but it should be rectified, and this can happen probably for indian languages with same script, so their might be two different puny codes. But, nice error found.

blastfromthepast
6th May 2006, 04:28 AM
This matters a great deal, because it seems that some keyboards are keying in ड़ and some are keying in ड़.

Both are Devanagari and valid. Sanskrit should be using the same encoding as Hindi. The script is language independent.

If you thought the Latin phishing domain issue was overblown, get ready for same-script phishing domains.

A little bit of googling revealed that this problem appears in other Indian scripts as well. Not just in devanagari.

a2zofb2b
6th May 2006, 07:14 PM
Looks like it could be a major problem.

Here are 2 variations and the unicode sequence for the same.

xn--e2b9bngm.com (साड़ी.com): = स ा ड़ ी

and

xn--12bmg5i.com (साड़ी.com): = स ा ड ़ ी

blastfromthepast
6th May 2006, 07:31 PM
when google maps it for searches it converts it into puny code and sees them only as a punycode, and the puny code for the first one is different from the second one, so different results

Google doesn't deal with punycode. Google searches for unicode text in utf-8 encoding. Since there appear to be two ways to enter this text in, google could, and should, combine the results.

When unicode is converted to punycode to create a domain name, it is supposed to be normalized so that such problems don't occur.

Looks like it could be a major problem.

Here are 2 variations and the unicode sequence for the same.

xn--e2b9bngm.com (साड़ी.com): = स ा ड़ ी

and

xn--12bmg5i.com (साड़ी.com): = स ा ड ़ ी

That is it. Thanks for the explaination.

I tried putting in both into IBM's punycode converter, and I get identical results (xn--e2b9bngm). If registrars implement the punycode conversion mechanism correctly then this shouldn't be a problem. Looks like some registrars had it wrong and weren't running the unicode through the nameprep routine and some people are now stuck with domain lookalikes because of errors in their registrars punycode conversion.

http://www-950.ibm.com/software/globalization/icu/demo/domain?t=test&x=22&y=17

IDN.TV
6th May 2006, 07:52 PM
A little bit of googling revealed that this problem appears in other Indian scripts as well. Not just in devanagari.


What other languages have you found this error ?

blastfromthepast
6th May 2006, 08:05 PM
What other languages have you found this error ?

There are two separate problems.

1. Google is not combining results for differently-ordered text in Indic scripts - this applies to any script where you can enter characters in a different order to produce the same character. This should be resolved by google in the future. Maybe we should let them know.

2. Some registrars were not performing the nameprep routine, which is supposed to resolve such differences and produce a single punycode. So domains that were registered early on may be affected.

drbiohealth
7th May 2006, 03:38 AM
That seems to be a genuine error. Thanks Dan for bringing that up!

blastfromthepast
16th March 2007, 10:10 PM
साड़ी.com
xn--12bmg5i.com
136 google


Formerly owned by Ms. Snow, this domain has now been dropped. Evidence Snow is reading this forum!?

Rubber Duck
16th March 2007, 10:31 PM
It is getting like swimming with sharks in the dark these days.

Who knows who is on the guest list?

alpha
16th March 2007, 10:44 PM
It is getting like swimming with sharks in the dark these days.

Who knows who is on the guest list?

I will officially piss my pants if it comes out one day the the Snows walk amongst us.

yanni
17th March 2007, 01:19 AM
I wouldn't be surprised. They certainly have got enough mentions here during this past year; enough to be noticed...

.

alpha
17th March 2007, 12:14 PM
I wouldn't be surprised. They certainly have got enough mentions here during this past year; enough to be noticed...

.

I meant in the sense that they are an active member of the board under an alias.

hey, maybe it's you. :eek: