PDA

View Full Version : 60,000 Chinese characters!


bwhhisc
27th February 2006, 01:56 AM
Interesting! Started doing a bit of reading on Chinese language and came across this. This just points up the challenges to find the right word, and term for Chinese IDN's.

QUOTE: Even to be able to read a novel (in Chinese), it is absolutely necessary to learn at least 3,000 symbols. People with college degrees are judged to have mastered about 6,000 -7,000 of them, but some large dictionaries can contain as many as 60,000 characters.

Each word is set and is basically written the exact same way as it was written 2,000 years ago.

SOURCE: http://www.logoi.com/notes/symbols_alphabet.html

blastfromthepast
27th February 2006, 02:02 AM
One a day... one a day...

bwhhisc
27th February 2006, 02:14 AM
One a day... one a day...

Yes...it does leave a lot to work thru...continuing to quote from the article:

"In Chinese writing, however, there are no letters and there is no alphabet".
The writing system consists of a large number of symbols used to directly represent words regardless of their value. Although there is some relation between the structure of each symbol and its pronunciation, but the symbols cannot be broken down into smaller components to construct a new word". I guess when all the other IDN's are mopped up in other languages, there will still be challenges here!

TYPING CHINESE ON THE COMPUTER- QUOTE: "The Chinese actually do have an alphabet which is known to the educated people in China who use it for inputting Chinese characters on the computer. The alphabet is the Latin alphabet and it cannot be called "Chinese." It is called pinyin, which simply means "spelling out the sounds" in Chinese. It is not an alphabet but a system of romanization. What this means is that any Chinese character can be written down with pinyin if needed. It provides, for example, a convenient method for writing Chinese on the computer. You type in the pinyin and then you get a list of characters that correspond to that sound".
END QUOTE-

touchring
27th February 2006, 02:21 AM
Yes, but many chinese words are formed out of pairs of characters - this applies to Kanji in some cases, so do not go and register all the 3000 characters!

Some examples of pairs:

房 - character for house/room (no ovt)

房屋 - house (ovt 53)
房子 - house (ovt 47)
房间 -room (no ovt)
房地产 - property (ovt 190)
房租 - house rent (no ovt)

IDNCowboy
27th February 2006, 02:32 AM
Yes, but many chinese words are formed out of pairs of characters - this applies to Kanji in some cases, so do not go and register all the 3000 characters!

Some examples of pairs:

房 - character for house/room (no ovt)

房屋 - house (ovt 53)
房子 - house (ovt 47)
房间 -room (no ovt)
房地产 - property (ovt 190)
房租 - house rent (no ovt)
its like playing the memory game :P

touchring
27th February 2006, 02:40 AM
its like playing the memory game :P

With the PC and pinyin, it's actually simpler to learn how to read and speak chinese than English (for a baby at least). Writing is a different issue though. :-)

Giant
27th February 2006, 03:03 AM
Yes, but many chinese words are formed out of pairs of characters - this applies to Kanji in some cases, so do not go and register all the 3000 characters!

Some examples of pairs:

房 - character for house/room (no ovt)

房屋 - house (ovt 53)
房子 - house (ovt 47)
房间 -room (no ovt)
房地产 - property (ovt 190)
房租 - house rent (no ovt)

Let say an average Chinese understands 5,000 characters, and Chinese words are 1-char, 2-char and 3-char terms.

Suppose each one of these 5,000 characters can match with an average of 10 other characters to make 10 2-char new terms, and each of these 10 new terms can matche with an average of 2 other characters to make 2 more 3-char new terms. Then:

5000 1-char words
5000 x 10 = 50k 2-char words
5000 x 10 x 2 = 100k 3-char words
---------------------------------------------
Total: 155k words.

bwhhisc
27th February 2006, 03:04 AM
With the PC and pinyin, it's actually simpler to learn how to read and speak chinese than English (for a baby at least). Writing is a different issue though. :-)

Is the quote about need to know at least "3000 symbols in order to read a novel" a reality then.

touchring
27th February 2006, 03:24 AM
It goes this way, once you understand 1000+ characters, you can guess out the meaning of all sorts of combinations.

That's how i guess Japanese words even though i have not learnt Kanji, and some words differ from Chinese.

Giant
27th February 2006, 03:24 AM
Let say an average Chinese understands 5,000 characters, and Chinese words are 1-char, 2-char and 3-char terms.

Suppose each one of these 5,000 characters can match with an average of 10 other characters to make 10 2-char new terms, and each of these 10 new terms can matche with an average of 2 other characters to make 2 more 3-char new terms. Then:

5000 1-char words
5000 x 10 = 50k 2-char words
5000 x 10 x 2 = 100k 3-char words
---------------------------------------------
Total: 155k words.

I must have done some wrong calculations here. I think this one would make more sense.

5000 1-char words
5000 x 10 = 50k 2-char words
5000 x 2 = 10k 3-char words (only 10% of all 2-char words can match with other 2 chars)
---------------------------------------------
Total: 65k words (the number of vocabulary a college student must master in US)

Is the quote about need to know at least "3000 symbols in order to read a novel" a reality then.

A good question! A Chinese language teacher probably has the knowledge to answer this question.

blastfromthepast
27th February 2006, 04:17 AM
Each character is made up of a basic set of characters, called radicals. There are only 214 of them, give or take a few variations. Once you memorize them and learn what they mean, memorizing the other characers that are simply combinations of the radicals in various ways is easy.

Reading through http://en.wikipedia.org/wiki/Radical_%28Chinese_character%29) should give you a good understanding of how this works.

bwhhisc
27th February 2006, 12:03 PM
Each character is made up of a basic set of characters, called radicals. There are only 214 of them, give or take a few variations. Once you memorize them and learn what they mean, memorizing the other characers that are simply combinations of the radicals in various ways is easy.

"Easy"...maybe if you have learned to speak Chinese already! So how do these 214 characters relate to IDNing...I assume that you need to make sure all of your regs include only these 214 symbols in the makeup of words? I have been finding Chinese symbols for words that are "different" than the online translators. An example is the Chinese symbols for the word lyric (I have found 3 different ways for saying this- all seemingly related to the same meaning- "lyric of song". Thanks for explaining.

Rubber Duck
27th February 2006, 12:13 PM
TYPING CHINESE ON THE COMPUTER- QUOTE: "The Chinese actually do have an alphabet which is known to the educated people in China who use it for inputting Chinese characters on the computer. The alphabet is the Latin alphabet and it cannot be called "Chinese." It is called pinyin, which simply means "spelling out the sounds" in Chinese. It is not an alphabet but a system of romanization. What this means is that any Chinese character can be written down with pinyin if needed. It provides, for example, a convenient method for writing Chinese on the computer. You type in the pinyin and then you get a list of characters that correspond to that sound".
END QUOTE-

Because all Chinese is written the same, but often spoken differently, Pinyin should refer to a specific Chinese Language/Dialect. Cantonese won't forcibly be able to understand Mandarin Pinyin.

"Easy"...maybe if you have learned to speak Chinese already! So how do these 214 characters relate to IDNing...I assume that you need to make sure all of your regs include only these 214 symbols in the makeup of words? I have been finding Chinese symbols for words that are "different" than the online translators. An example is the Chinese symbols for the word lyric (I have found 3 different ways for saying this- all seemingly related to the same meaning- "lyric of song". Thanks for explaining.

From a domainers perspective it is important to understand that the characters are grouped into systems that help them to be generated rapidly and conveniently. The Radicals system is one system, but is not actually used by Wubi Keyboard which uses another classification.

It is nice to be able to recognise a few of the simpler characters but unless you are going to spend several years learning the language, I wouldn't bother going into it too deeply. You'll soon start to become familar with some of your acquisitions.

bwhhisc
27th February 2006, 01:30 PM
It is nice to be able to recognise a few of the simpler characters but unless you are going to spend several years learning the language, I wouldn't bother going into it too deeply. You'll soon start to become familar with some of your acquisitions.

Not at all trying to get into this too deeply, I assure you! Just trying to get clear:
QUOTE: Originally Posted by blastfromthepast
Each character is made up of a basic set of characters, called radicals. There are only 214 of them, give or take a few variations. END QUOTE

What I am trying to get clear on- is if there are only 214 characters to make up all Chinese IDNs? I am finding Chinese characters in books, that I can not find in order to "cut and paste" to see if it can be registered. If I know you have to stay in the bounds of the 214 characters, that is good information.

touchring
27th February 2006, 01:55 PM
Actually, it is very simple, just register keywords that give the highest US OVT.

Found an article on Chinese IMEs - http://www.microsoft.com/globaldev/handson/user/IME_Paper.mspx#EDAA

Rubber Duck
27th February 2006, 02:15 PM
Not at all trying to get into this too deeply, I assure you! Just trying to get clear:
QUOTE: Originally Posted by blastfromthepast
Each character is made up of a basic set of characters, called radicals. There are only 214 of them, give or take a few variations. END QUOTE

What I am trying to get clear on- is if there are only 214 characters to make up all Chinese IDNs? I am finding Chinese characters in books, that I can not find in order to "cut and paste" to see if it can be registered. If I know you have to stay in the bounds of the 214 characters, that is good information.

No, there are 50,000, but if you think of radicals are Level 1 Menu choices it enables them to be categorised and located. Even the Chinese need to be able to look up words in a dictionary. The Radicals enable them to order things a bit like we use ABC to define location in a dictionary.

Yep, finding the Unicode in the first place has always been the problem. Does the book give you the Pinyin as you can often find the Unicode from the Pinyin?

blastfromthepast
27th February 2006, 03:19 PM
Each character itself is made up of combinations of smaller elements, known as radicals, of which there are 214.

Here is a simple example of what I mean by RADICAL.


This is a radical character. It means TREE. And, it also looks like a TREE.
Radical characters are used to create other characters.


This is the character for WOODS. It is made up of TWO TREES put together.


This is the character for FOREST. It is made up of THREE TREES put together.

林森
This is the JAPANESE CHARACTER COMBINATION made up of WOODS and FOREST characters. It means DEEPLY FORESTED.

This is why I said, once you know the 214 radicals, you can memorize the rest, by remembering which radicals make up the more complicated characters. So, you can describe a character in your head this way:

帚 BROOMSTICK at the top and 女 WOMAN at the bottom means WIFE 妻. Now you've remembered how to read WIFE. Note: If you find this offensive, well, that's not my fault, it's the way WIFE is written in Chinese. In Japanese, the word WIFE is written as 妻室, adding 室 ROOM at the end!

When I said it was easy, I meant it is easier than you think to learn new characters, it's not just thousands of meaningless symbols that are all different and have no logic to them that you have to memorize.

bwhhisc
28th February 2006, 11:51 PM
Each character itself is made up of combinations of smaller elements, known as radicals, of which there are 214. When I said it was easy, I meant it is easier than you think to learn new characters, it's not just thousands of meaningless symbols that are all different and have no logic to them that you have to memorize.

Thanks Blast- for taking time to give this example. That really was a great explanation. Probably where the saying "a picture is worth a thousand words" came from. Anyway, we are all still learning here. Regards, Bill