View Full Version : Finding out the language of an IDN
sevent
2nd September 2006, 05:06 PM
Given just the punycode, is it possible to decode the language characterset of a domain?
Also, it seems that both Chinese and Japanes can have the same character, but these are different IDN's. From what I have heard, if one is taken the registry prevents you from reserving the same character in another language. Is that correct?
Thanks!
bramiozo
2nd September 2006, 05:22 PM
http://idntools.net/bulkpuny3.php to get the characterset from punycode.
Yes there is a thing called variant blocking, this is described here (http://www.verisign.com/information-services/naming-services/internationalized-domain-names/idn-standards/idn-character-variants/page_002087.html).
sevent
2nd September 2006, 05:54 PM
http://idntools.net/bulkpuny3.php to get the characterset from punycode.
Yes there is a thing called variant blocking, this is described here (http://www.verisign.com/information-services/naming-services/internationalized-domain-names/idn-standards/idn-character-variants/page_002087.html).
Thanks for the info! I did a check of a name and it converted fine to the native looking characters but the script was labeled:
CJKUnifiedIdeographs
Does that seem right? Is there a way to find out from this info what language you are really talking about (ie. Chinese simplified)?
Drewbert
2nd September 2006, 06:05 PM
CJKUnifiedIdeographs
Does that seem right? Is there a way to find out from this info what language you are really talking about (ie. Chinese simplified)?
What do you think might be a way of figuring out if a text string is Chinese, Japanese or Korean, or multiples of them?
bramiozo
2nd September 2006, 06:16 PM
Thanks for the info! I did a check of a name and it converted fine to the native looking characters but the script was labeled:
CJKUnifiedIdeographs
Does that seem right? Is there a way to find out from this info what language you are really talking about (ie. Chinese simplified)?
There are unicode-ranges which are used for several languages (latin,kanji,romaji etc.), if each char of a string is within the overlapped range it is impossible to determine the language directly. One would have to rely on char-groups, char positions etc., the statistical occurrence of a certain combination would then determine the probabilities of the different languages.
It's possible but it requires quite an effort into the relevant languages if you want to pull it off.
vBulletin® v3.8.4, Copyright ©2000-2024, Jelsoft Enterprises Ltd.