PDA

View Full Version : Finding out the language of an IDN


sevent
2nd September 2006, 05:06 PM
Given just the punycode, is it possible to decode the language characterset of a domain?

Also, it seems that both Chinese and Japanes can have the same character, but these are different IDN's. From what I have heard, if one is taken the registry prevents you from reserving the same character in another language. Is that correct?

Thanks!

bramiozo
2nd September 2006, 05:22 PM
http://idntools.net/bulkpuny3.php to get the characterset from punycode.

Yes there is a thing called variant blocking, this is described here (http://www.verisign.com/information-services/naming-services/internationalized-domain-names/idn-standards/idn-character-variants/page_002087.html).

sevent
2nd September 2006, 05:54 PM
http://idntools.net/bulkpuny3.php to get the characterset from punycode.

Yes there is a thing called variant blocking, this is described here (http://www.verisign.com/information-services/naming-services/internationalized-domain-names/idn-standards/idn-character-variants/page_002087.html).

Thanks for the info! I did a check of a name and it converted fine to the native looking characters but the script was labeled:

CJKUnifiedIdeographs

Does that seem right? Is there a way to find out from this info what language you are really talking about (ie. Chinese simplified)?

Drewbert
2nd September 2006, 06:05 PM
CJKUnifiedIdeographs

Does that seem right? Is there a way to find out from this info what language you are really talking about (ie. Chinese simplified)?

What do you think might be a way of figuring out if a text string is Chinese, Japanese or Korean, or multiples of them?

bramiozo
2nd September 2006, 06:16 PM
Thanks for the info! I did a check of a name and it converted fine to the native looking characters but the script was labeled:

CJKUnifiedIdeographs

Does that seem right? Is there a way to find out from this info what language you are really talking about (ie. Chinese simplified)?

There are unicode-ranges which are used for several languages (latin,kanji,romaji etc.), if each char of a string is within the overlapped range it is impossible to determine the language directly. One would have to rely on char-groups, char positions etc., the statistical occurrence of a certain combination would then determine the probabilities of the different languages.

It's possible but it requires quite an effort into the relevant languages if you want to pull it off.