PDA

View Full Version : IDN-characters with php


bramiozo
30th October 2005, 10:13 PM
I have a problem with IDN-characters in php.
I can enter utf-8 characters but they are not correctly translated when it is submitted to the whois-search.

if($ext=="com" || $ext=="net"||$ext=="org" || $ext=="COM" || $ext=="NET"||$ext=="ORG")
if(stristr($domein, "xn--")){$NICserver="whois.internic.net";$zoek="$domein.$ext";}
elseif(preg_match('/^[a-z0-9]+$/i',$domein)){$NICserver="whois.networksolutions.com";$zoek="$domein.$ext";}
else
{$NICserver="whois.internic.net";$zoek="$domein.$ext"; echo "$domein" ;}

Does anyone have any tips regarding IDN and PHP ?

Rubber Duck
30th October 2005, 10:31 PM
I have a problem with IDN-characters in php.
I can enter utf-8 characters but they are not correctly translated when it is submitted to the whois-search.

if($ext=="com" || $ext=="net"||$ext=="org" || $ext=="COM" || $ext=="NET"||$ext=="ORG")
if(stristr($domein, "xn--")){$NICserver="whois.internic.net";$zoek="$domein.$ext";}
elseif(preg_match('/^[a-z0-9]+$/i',$domein)){$NICserver="whois.networksolutions.com";$zoek="$domein.$ext";}
else
{$NICserver="whois.internic.net";$zoek="$domein.$ext"; echo "$domein" ;}

Does anyone have any tips regarding IDN and PHP ?




This is a bit over my head, but I do know that you can only Whois the punycode.

Best Regards
Dave Wrixon

bramiozo
31st October 2005, 10:30 AM
That's unfortunate, are there any completely packages puny-code converters out there ? I mean, the converters must be tied to a database of some sort, do all the online converters link back to the same database or do they have their own database ?

Thanks ;) .

Rubber Duck
31st October 2005, 01:09 PM
That's unfortunate, are there any completely packages puny-code converters out there ? I mean, the converters must be tied to a database of some sort, do all the online converters link back to the same database or do they have their own database ?

Thanks ;) .




As far as I am aware, and to some extent I am guessing, but the punycode is simply Unicode put into an encoding algorythm and this is contained within the browser or the browser pluggin.

The Unicode is identified a number identifyer for each character which is generated by the keyboard stroke.

I think the source code for this is fairly short and widely available on the internet.

Best Regards
Dave Wrixon

bramiozo
31st October 2005, 03:12 PM
Found it here :
http://phlymail.de/en/downloads/idna/download/

bramiozo
31st October 2005, 03:56 PM
hmmm, only iso-8859-1 is supported((utf8_encode), the chinese characters and iso 8859-2..8859-16 need to be converted seperately.

Rubber Duck
31st October 2005, 04:10 PM
hmmm, only iso-8859-1 is supported((utf8_encode), the chinese characters and iso 8859-2..8859-16 need to be converted seperately.



<?

/*

Copyright (C) 2004 Owen Borseth - owen@name.com

See additional usage and copyright information at http://www.bluerider.com/idn/copyright.php.

*/

$prefix = "xn--";
$delim = "-";
$base = 36;
$tmin = 1;
$tmax = 26;
$skew = 38;
$damp = 700;
$initial_bias = 72;
$initial_n = 128;

function unicode_hexncr($text)
{
global $initial_n;

$text = utf8_to_unicode($text);
$return_array = array();

foreach($text as $codepoint)
{
if($codepoint >= $initial_n)
array_push($return_array, "&#x".dechex($codepoint).";");
else
array_push($return_array, chr($codepoint));
}

return($return_array);
}

function decode($text)
{
global $base, $tmin, $tmax, $skew, $damp, $initial_bias, $initial_n, $prefix, $delim;

$n = $initial_n;
$i = 0;
$bias = $initial_bias;
$output = array();

if(substr($text, 0, strlen($prefix)) != $prefix)
return($text);
else
$text = str_replace($prefix, "", $text);

$delim_pos = strrpos($text, $delim);

if($delim_pos !== false)
{
for($j = 0; $j < $delim_pos; $j++)
array_push($output, $text[$j]);
$text = substr($text, $delim_pos + 1);
}

for(; strlen($text) > 0;)
{
$oldi = $i;
$w = 1;

for($k = $base;1; $k = $k + $base)
{
$digit = decode_digit($text[0]);
$text = substr($text, 1);
$i = $i + $digit * $w;

$t = 0;
if($k <= $bias + $tmin)
$t = $tmin;
elseif($k >= $bias + $tmax)
$t = $tmax;
else
$t = $k - $bias;

if($digit < $t)
break;

$w = $w * ($base - $t);
}

$bias = adapt($i - $oldi, sizeof($output) + 1, $oldi == 0);
$n = $n + floor($i / (sizeof($output) + 1));
$i = $i % (sizeof($output) + 1);

$tmp = $output;
$output = array();

$j = 0;
for($j = 0; $j < $i; $j++)
array_push($output, $tmp[$j]);
array_push($output, unicode_to_utf8($n));
for($j = $j; $j < sizeof($tmp); $j++)
array_push($output, $tmp[$j]);

$i++;
}

return(implode($output));
}

function encode($text)
{
global $base, $tmin, $tmax, $skew, $damp, $initial_bias, $initial_n, $prefix, $delim;

$text = utf8_to_unicode($text);

$codecount = 0;
$basic_string = "";
$extended_string = "";

for ($i = 0; $i < sizeof($text); $i++)
{
if($text[$i] < $initial_n)
{
$basic_string .= chr($text[$i]);
$codecount++;
}
}

$n = $initial_n;
$delta = 0;
$bias = $initial_bias;
$h = $codecount;

while($h < sizeof($text))
{
$m = 100000;
for($j = 0; $j < sizeof($text); $j++)
{
if($text[$j] >= $n && $text[$j] <= $m)
{
$m = $text[$j];
}
}

$delta = $delta + ($m - $n) * ($h + 1);
$n = $m;

for($j = 0; $j < sizeof($text); $j++)
{
$c = $text[$j];

if($c < $n)
$delta++;
elseif($c == $n)
{
$q = $delta;
for($k = $base;1;$k = $k + $base)
{
$t = 0;
if($k <= $bias + $tmin)
$t = $tmin;
elseif($k >= $bias + $tmax)
$t = $tmax;
else
$t = $k - $bias;

if($q < $t)
break;

$extended_string .= encode_digit($t + (($q - $t) % ($base - $t)));
$q = floor(($q - $t) / ($base - $t));
}
$extended_string .= encode_digit($q);

$bias = adapt($delta, $h+1, $h==$codecount);
$delta = 0;
$h++;
}
}
$delta++;
$n++;
}

if(strlen($basic_string) > 0 && strlen($extended_string) < 1)
{
$encoded = $basic_string;
}
elseif(strlen($basic_string) > 0 && strlen($extended_string) > 0)
{
$encoded = $prefix.$basic_string.$delim.$extended_string;
}
elseif(strlen($basic_string) < 1 && strlen($extended_string) > 0)
{
$encoded = $prefix.$extended_string;
}

return($encoded);
}

function adapt($delta, $numpoints, $firsttime)
{
global $base, $tmin, $tmax, $skew, $damp;

if($firsttime)
$delta = floor($delta / $damp);
else
$delta = floor($delta / 2);

$delta = $delta + floor($delta / $numpoints);

$k = 0;
while($delta > floor((($base - $tmin) * $tmax) / 2))
{
$delta = floor($delta / ($base - $tmin));
$k = $k + $base;
}

return($k + (floor((($base - $tmin + 1) * $delta) / ($delta + $skew))));
}

/*

Function encode_digit and decode_digit were adapted from punycode.c, part of GNU Libidn.

http://www.gnu.org/software/libidn/doxygen/punycode_8c-source.html

*/
function encode_digit($d)
{
return chr(($d + 22 + 75 * ($d < 26)));
}

function decode_digit($cp)
{
global $base;

$cp = ord($cp);
return ($cp - 48 < 10) ? $cp - 22 : (($cp - 65 < 26) ? $cp - 65 : (($cp - 97 < 26) ? $cp - 97 : $base));
}

/*

Copyright (C) 2002 Scott Reynen

Function utf8_to_unicode and unicode_to_utf8 was taken from an article titled "How to develop multilingual, Unicode
applications with PHP" at the following URL:

http://www.randomchaos.com/document.php?source=php_and_unicode

*/
function unicode_to_utf8( $unicode )
{
$utf8 = '';

if ( $unicode < 128 )
{
$utf8.= chr( $unicode );
}
elseif ( $unicode < 2048 )
{
$utf8.= chr( 192 + ( ( $unicode - ( $unicode % 64 ) ) / 64 ) );
$utf8.= chr( 128 + ( $unicode % 64 ) );
}
else
{
$utf8.= chr( 224 + ( ( $unicode - ( $unicode % 4096 ) ) / 4096 ) );
$utf8.= chr( 128 + ( ( ( $unicode % 4096 ) - ( $unicode % 64 ) ) / 64 ) );
$utf8.= chr( 128 + ( $unicode % 64 ) );
}

return $utf8;
}

function utf8_to_unicode( $str )
{

$unicode = array();
$values = array();
$lookingFor = 1;

for ($i = 0; $i < strlen( $str ); $i++ )
{

$thisValue = ord( $str[ $i ] );

if ( $thisValue < 128 )
$unicode[] = $thisValue;
else
{

if ( count( $values ) == 0 )
$lookingFor = ( $thisValue < 224 ) ? 2 : 3;

$values[] = $thisValue;

if ( count( $values ) == $lookingFor )
{
$number = ( $lookingFor == 3 ) ?
( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):
( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );

$unicode[] = $number;
$values = array();
$lookingFor = 1;
}
}
}
return $unicode;
}

?>

gammascalper
31st October 2005, 05:51 PM
Good find guys... I'll try to put up a bulk checker soon.

bramiozo
31st October 2005, 08:35 PM
http://beginnerguides.com.server6.firstfind.nl/tblox/domeincheck.php works with utf-8

It doesn't accept all characters yet.

I'll try iconv ( http://www.phpfreaks.com/phpmanual/page/function.iconv.html ) for the other characters (standard in php 5) .


I have tried to implement the code you found Dw but it continuously says "Call to an undefined function: " .

some tools :
http://beginnerguides.com.server6.firstfind.nl/tblox/whois-idn-tools.rar

bramiozo
31st October 2005, 09:32 PM
It would offcourse be mighty fine if we could identify the charset.

determine_charset() doesn't seem to work and I can't find anything suitable on php.net.

bramiozo
1st November 2005, 10:55 AM
Almost there, the correct punycode is produced for all characters, but the characters are not displayed correctly yet.
header("content-type: text/html; charset=UTF-8"); doesn't work
and the meta-equivalent doesn't work either, the encoding for the page is still set to iso-8859-1, does anyone have a solution for this ?

gammascalper
1st November 2005, 11:32 AM
Almost there, the correct punycode is produced for all characters, but the characters are not displayed correctly yet.
header("content-type: text/html; charset=UTF-8"); doesn't work
and the meta-equivalent doesn't work either, the encoding for the page is still set to iso-8859-1, does anyone have a solution for this ?


Nice bramiozo -- works well except for this issue which Sedo also has problems with. Try putting this in your .htaccess:

CharsetSourceEnc utf-8

bramiozo
1st November 2005, 11:39 AM
Haha I just found the .htaccess solution on the net, thanks anyway ;D .

Got it, it works fine now !

I'll translate it to english and then I will put up the code.

bramiozo
1st November 2005, 12:15 PM
Removed ......

bramiozo
1st November 2005, 12:27 PM
Ah shit, the same problem again, when I changed the .htaccess it considers áll files as text files (it simply opened the 1.6 mb .rar files as if was .txt), therefore I changed lines in .htaccess to
AddCharset UTF-8 .html
AddCharset UTF-8 .php

The .rar files can now be downloaded but the site is again set at iso-8889-1.

bramiozo
1st November 2005, 12:31 PM
OK finished (finally)

This did the trick :
AddDefaultCharset UTF-8

Rubber Duck
1st November 2005, 04:30 PM
Ah shit, the same problem again, when I changed the .htaccess it considers áll files as text files (it simply opened the 1.6 mb .rar files as if was .txt), therefore I changed lines in .htaccess to
AddCharset UTF-8 .html
AddCharset UTF-8 .php

The .rar files can now be downloaded but the site is again set at iso-8889-1.





Don't know whether this helps at all?

Dave Wrixon

<?

require("header.php");

?>

<table width="775">
<tr>
<td>

Also try <a href='index.php'>SINGLE</a> and <a href='bulk.php'>BULK</a> conversions.
<br>

<?

require("common_body.php");
require("punycode.php");

?>

<form action="index.php" method="post">
<b>Input</b><br>
<input type='text' name='text'><br><br>
<input type='radio' name='type' value='native' checked> Unicode -> Punycode<br>
<input type='radio' name='type' value='punycode'> Punycode -> Unicode<br><br>
<input type='submit' name='submit' value='submit'>
</form>
<br><br>

<?

$text = trim(strtolower($text));
$tld = strrchr($text, ".");

if($tld)
$text = str_replace($tld, "", $text);

if($text && $type == "native")
{
echo("<table border='1' cellpadding='0' bgcolor='#e2e2e2' width='100%'>");
echo("<tr><td colspan='2' bgcolor='#f9ff68'><b>Primary Info</b></td></tr>");
echo("<tr><td>Unicode Text:&nbsp;&nbsp;&nbsp;</td><td>".str_replace(" ", "&nbsp;", "$text$tld")."</td></tr>");
$hex_array = unicode_hexncr($text);
$text = encode($text);
echo("<tr><td>Punycode Text:&nbsp;&nbsp;&nbsp;</td><td><a href='http://www.domainsite.com/shopping_cart.php?domain=$text&amp;opttldarray[]=com&amp;opttldarray[]=net&amp;opttldarray[]=org&amp;opttldarray[]=biz' target='_blank'>".str_replace(" ", "&nbsp;", "$text$tld")."</a></td></tr>");
echo("</table><br>");

echo("<table border='0' cellpadding='0' bgcolor='#e2e2e2' width='100%'>");
echo("<tr><td colspan='2' bgcolor='#f9ff68'><b>Additional Info</b></td></tr>");
echo("<tr><td>Hex NCR's:&nbsp;&nbsp;&nbsp;</td><td><textarea cols='40'>");

foreach($hex_array as $hex)
{
echo(htmlentities("$hex"));
}

echo("$tld</textarea></td></tr>");
echo("</table><br>");

echo("<font size='-1'><b>hint</b> - you can display the Unicode on a webpage by pasting the hex NCR's into your HTML</font>");
echo("<br>");
}



elseif($text && $type == "punycode")
{
echo("<table border='0' cellpadding='0' bgcolor='#e2e2e2' width='100%'>");
echo("<tr><td colspan='2' bgcolor='#f9ff68'><b>Primary Info</b></td></tr>");
echo("<tr><td>Punycode Text:&nbsp;&nbsp;&nbsp;</td><td><a href='http://www.domainsite.com/shopping_cart.php?domain=$text&amp;opttldarray[]=com&amp;opttldarray[]=net&amp;opttldarray[]=org&amp;opttldarray[]=biz' target='_blank'>".str_replace(" ", "&nbsp;", "$text$tld")."</a></td></tr>");
$text = decode($text);
$hex_array = unicode_hexncr($text);
echo("<tr><td>Unicode Text:&nbsp;&nbsp;&nbsp;</td><td>".str_replace(" ", "&nbsp;", "$text$tld")."</td></tr>");
echo("</table><br>");

echo("<table border='0' cellpadding='0' bgcolor='#e2e2e2' width='100%'>");
echo("<tr><td colspan='2' bgcolor='#f9ff68'><b>Additional Info</b></td></tr>");
echo("<tr><td>Hex NCR's:&nbsp;&nbsp;&nbsp;</td><td><textarea cols='40'>");

foreach($hex_array as $hex)
{
echo(htmlentities("$hex"));
}

echo("$tld</textarea></td></tr>");
echo("</table><br>");

echo("<font size='-1'><b>hint</b> - you can display the Unicode on a webpage by pasting the hex NCR's into your HTML</font>");
echo("<br>");
}

echo("<font size='-1'><b>hint</b> - translate words at <a href='http://babelfish.altavista.com/'
target='_blank'>AltaVista</a> and then encode them here</font><br><br>");

?>

</td>

<td valign='top'>
<a href='http://www.domainsite.com' target='_blank'><img src='dsiteidn.png' border='0' alt='Register IDN domains at
www.domainsite.com'></a>
</td>
</tr>
</table>

<?

require("footer.php");

?>

bramiozo
1st November 2005, 05:54 PM
Thanks for the aid Dw but the matter is pretty much resolved and I want to keep matters in my own hand so linking to an external source is not really an option (also with regard to the graphical output).

http://beginnerguides.com.server6.firstfind.nl/tblox/idn-comp-whois.rar

I am working on a script that you will all find very interesting, you'll see when I give you the link, have patience though :) .

Rubber Duck
1st November 2005, 06:13 PM
Thanks for the aid Dw but the matter is pretty much resolved and I want to keep matters in my own hand so linking to an external source is not really an option (also with regard to the graphical output).

http://beginnerguides.com.server6.firstfind.nl/tblox/idn-comp-whois.rar

I am working on a script that you will all find very interesting, you'll see when I give you the link, have patience though :) .



No worries. I have much more luck with Asian Languages than this gobbledegook!

Best Regards
Dave Wrixon

bramiozo
10th November 2005, 12:15 PM
http://beginnerguides.com.server6.firstfind.nl/tblox/bulkpuny2.php

That puny-converter works, the bulk whois doesn't work just yet.

Rubber Duck
10th November 2005, 12:51 PM
http://beginnerguides.com.server6.firstfind.nl/tblox/bulkpuny.php

That puny-converter works, the bulk whois doesn't work just yet.



Promising start, keep us posted. Perhaps you can then get Olney to include a link from this site.

Best Regards
Dave Wrixon

bramiozo
15th November 2005, 02:24 PM
<?PHP
//include('idna_convert.class.php');
class domeincheck
{
/*
Deze functie kan worden het beste worden uitgebuit met een boolean
*/

function domeincheck($domain)
{
if(!isset($domain) && $domain!="")
{
if(stristr($domain,"http://") || stristr($domain,"www."))
{
if(stristr($domain,"http://"))
{
$domain=str_replace("http://","",$domain);
}

if(stristr($domain,"www."))
{
$domain=str_replace("www.","",$domain);
}
}
while ($i < mb_strlen($domain))
{ $punt = mb_substr($domain,$i,1);
if ($punt == ".")
{
$ext = mb_substr($domain,$i+1,mb_strlen($domain)-$s);
$domain = mb_substr($domain,0,$i);
$punt="";

if(!preg_match('/^[a-z0-9]+$/i',$domain))
{
$IDNcheck=TRUE;
$IDN = new idna_convert();
$idomain = $IDN->encode($domain);
$search="$idomain.$ext";
}


if($ext=="com" || $ext=="net"||$ext=="org" || $ext=="COM" || $ext=="NET"||$ext=="ORG")
{
$NICserver="whois.internic.net";

if(stristr($domain, "xn--") && $IDNcheck=false){$search="$domain.$ext";}
elseif(preg_match('/^[a-z0-9]+$/i',$domain) && $IDNcheck=false) {$search="$domain.$ext";}
}
}
elseif($ext=="be" || $ext=="BE"){$NICserver="whois.dns.be";$search="$domain.$ext";}
elseif($ext=="to" || $ext=="TO"){$NICserver="whois.tonic.to";$search="$domain.$ext";}
elseif($ext=="info" || $ext=="INFO"){$NICserver="whois.opensrs.net";$search="$domain.$ext";}
//Als iemand meer NICserver weet voeg ze dan hier toe!
elseif($ext=="bz" || $ext=="BZ"){$NICserver="mhpwhois1.verisign-grs.net";$search="$domain.$ext";}
elseif($ext=="info" || $ext=="INFO"){$NICserver="whois.opensrs.net";$search="$domain.$ext";}
else {$NICserver="whois.nic.$ext";$search="$domain.$ext";}
{
$i++;
}
}
$socket = fsockopen("$NICserver", 43);
if(!$socket)
{ echo "<font face='Arial' size='1'>failed to retreive $domain.$ext , <B>$NICserver</B> probably doesn't exist.<BR>\n A possible cause is the inexistence of the extension <B>$ext</B> .<BR>\n Or the whois server is temporarily unavailable or overloaded.<BR>\n </font>";
}
else
{
fputs($socket,"$search \n");
//echo "$search";
while(!feof($socket))
{
$output .= fgetss($socket,128);
}
fclose($socket);
if(strstr($output,"No match"))
{
$available="is available";
return true;//return echo "<h2>$domain.$ext $available</h2>";

}
elseif(isset($output))
{
$available="is not available";
return false; //return echo "<h2>$domain.$ext $available</h2>";
}
}
}
return $available;
}
}
?>


I want this to return a true/false but it doesn't work at all, it gives "1" for all domains.

gammascalper
15th November 2005, 03:16 PM
I want this to return a true/false but it doesn't work at all, it gives "1" for all domains.


I scanned the code, but haven't tested it.

The script returns $available which is always going to be true.

bramiozo
15th November 2005, 04:55 PM
........
........
if(strstr($output,"No match"))
{
return true;
}
elseif(isset($output))
{
return false;
}
}
}

}
}
?>


Doesn't work

bramiozo
15th November 2005, 08:26 PM
Now it's just a function, but it's practically the same.


fclose($socket);
if(strstr($output,"No match"))
{
$check=true;
}
elseif(isset($output))
{
$check=false;
}
}
}
if($check=true){print "available";}else{print "not available";}
}
?>


Doesn't work either :(

gammascalper
15th November 2005, 08:31 PM
if($check=true){print "available";}else{print "not available";}

Comparison operator should be ==

bramiozo
23rd November 2005, 07:30 PM
Okidoki, it's working, all the time the class didn't work because I didn't allow it to work :
I had if(!isset($dnames) && $dnames!="") which off course doesn't allow anything.

http://beginnerguides.com.server6.firstfind.nl/tblox/bulkwhois2.php

and you were right about the comparison operator :) .

gammascalper
23rd November 2005, 09:02 PM
Okidoki, it's working, all the time the class didn't work because I didn't allow it to work :
I had if(!isset($dnames) && $dnames!="") which off course doesn't allow anything.

http://beginnerguides.com.server6.firstfind.nl/tblox/bulkwhois2.php

and you were right about the comparison operator :) .



Nice job bramiozo!

Those little bugs are hard to catch.

bramiozo
27th November 2005, 05:05 PM
http://www.idntools.net, not yet finished but useful nonetheless

bramiozo
29th November 2005, 05:06 PM
it's working better now, there were some problems in IE apparantly.
The number/letter tool needs to have constraints, i.e. searching for all available nnnn.com's with double vowels or things like that...
I could print the results to a file and in the db, the file is useful for obvious reasons and the db could be used to prevent searching for names that are already registered.

so much to do, so little time

bramiozo
2nd December 2005, 06:24 PM
I expected at least the people on this board to join my little exercise :)

gammascalper
2nd December 2005, 09:54 PM
I expected at least the people on this board to join my little exercise :)


Very nice job bramiozo -- I will try to contribute more in future... like you, I'm swamped!

Here's an idea of a feature that will make the bulk-checker more useful:

- auto-detect the IDN language and use a proxy to return translations from babelfish or google.com/translate_t

bramiozo
2nd December 2005, 11:24 PM
I need to upgrade my skills :o .

I also mean members of idntools.net, currently only Dave and IDNer are members :-[ .

If people add new names (or extensions) and update existing names it would save a load of my back.
Once the database is filled completely I can maybe do a weekly automated whois-search of the entire db.

bramiozo
2nd December 2005, 11:29 PM
I think the languages can be easily identified by looking at the character tables but I've no idea how to extract a translation from babelfish ???

gammascalper
2nd December 2005, 11:37 PM
I need to upgrade my skills :o .

I also mean members of idntools.net, currently only Dave and IDNer are members :-[ .

If people add new names (or extensions) and update existing names it would save a load of my back.
Once the database is filled completely I can maybe do a weekly automated whois-search of the entire db.




I joined just now :)

bramiozo
3rd December 2005, 11:24 AM
Welcome ;)

bramiozo
9th December 2005, 07:39 PM
bulkregister works :) .

Olney
10th December 2005, 09:26 AM
Didn't really realize what this project was.
I'll register to submit feedback in a day or so.

Looks great.

bramiozo
17th January 2006, 05:56 PM
added some languages :

Gurmukhi (atomic)
Gujarati (atomic)
oriya (atomic, non-xid,decomposable)
tamil (atomic,non-xid,decomposable)
telugu (atomic,decomposable)
kannada (atomic,decomposable)
malayam (atomic,decomposable)
thai (atomic, non-xid)
lao (atomic)
tibetan (atomic, non-xid)
georgian (atomic)
hangul (atomic, decomposable)
bopomofo (atomic)
han (atomic)

+ I am almost done with the autocheck, I will try to add functions with regard to overture and things like that....but first, my studies ...

gammascalper
17th January 2006, 06:46 PM
added some languages :

Gurmukhi (atomic)
Gujarati (atomic)
oriya (atomic, non-xid,decomposable)
tamil (atomic,non-xid,decomposable)
telugu (atomic,decomposable)
kannada (atomic,decomposable)
malayam (atomic,decomposable)
thai (atomic, non-xid)
lao (atomic)
tibetan (atomic, non-xid)
georgian (atomic)
hangul (atomic, decomposable)
bopomofo (atomic)
han (atomic)

+ I am almost done with the autocheck, I will try to add functions with regard to overture and things like that....but first, my studies ...


Great tool bramiozo. I use it, oh, at least once a week ;-)

It would be really nice if you could sort the unicode output by language.

bramiozo
17th January 2006, 08:37 PM
Hmm, what do you mean exactly, order by ABC ?

You mentioned this :
- auto-detect the IDN language and use a proxy to return translations from babelfish or google.com/translate_t

What about

name.com --> availability --> translation --> search eng. results --> overture if possible .

Right now all this things are done seperately...

Also, the puny-converter doesn't convert the korean characters correctly for some reason ???

gammascalper
17th January 2006, 11:49 PM
For the bulk converter, it would nice to group the output by language for easy cut & paste translation. For example, output all Arabic IDN, then Chinese, Japanese etc.

At present, output is in the same order as punycode input.

I think the proxy translation may be difficult to implement, but that's something people may be willing to subscibe to.

Looks like it's shaping up!

bramiozo
19th January 2006, 08:39 AM
For the bulk converter, it would nice to group the output by language for easy cut & paste translation. For example, output all Arabic IDN, then Chinese, Japanese etc.

At present, output is in the same order as punycode input.

I think the proxy translation may be difficult to implement, but that's something people may be willing to subscibe to.

Looks like it's shaping up!

I have been horsing around a bit and found that most languages can not be detected, I'll probably have to convert them to unicode, I'll add an extra column for that in the db. utf8 support still sucks ass amazingly..

bramiozo
22nd January 2006, 11:14 AM
It's working to the extent that all the scripts in the db are recognized.

example output:
script original converted availability
latin é.com xn--9ca.com taken
latin þ.com xn--vda.com taken
latin ä.com xn--4ca.com taken

It cannot discriminate between languages because of character overlap and mixed words. I have entered the unicode block ranges in the db, with that it can detect all scripts. The puny converter still has to be made compatible with korean chars.

future, hopefully :
script original converted translation SE overture availability
latin é.com xn--9ca.com fffr 23423 10000 FREE
latin þ.com xn--vda.com rrt 23423 10000 FREE
latin ä.com xn--4ca.com trtr 23423 10000 FREE

translation and overture are almost impossible because I have to find the specific language.

bramiozo
22nd January 2006, 11:15 AM
I have made the bulk puny converter available again at :
http://beginnerguides.com.server6.firstfind.nl/tblox/bulkpuny2.php

zorglub
13th January 2008, 04:26 PM
Bramiozo, any update about the detection of the language of an IDN in PHP ? So far I've found nothing on the web but I see some people have it more or less (ex idnwhois.org).

bramiozo
13th January 2008, 06:03 PM
It's possible to detect the characterset from the unicode tables, I have been able to do that for a long time but detecting the language from one word is not possible since several languages might use the same characterset, cyrillic might be Russian, but it might as well be Bulgarian or Ukranian etc. Hangul, Katakana and Hirigana are exceptions of course. I am very sure idn whois org is not able to do that either.

zorglub
13th January 2008, 06:38 PM
Thanks ! I see... anyway it's already nice to sort by characterset. Is there a PHP class available to do that ? Hopefully PHP6 will be published soon...

jose
13th January 2008, 07:08 PM
Yes, there is. It's a utf-8 class that replaces all non compatible PHP5 functions with his own.

zorglub
13th January 2008, 07:21 PM
I already have idna_convert, but doesn't let me detect the character set. If there's another one can you tell me which one it is ?