Hi all,
Do you have some source code sample or any idea how can i identify the language type from a given String.
e.g-
“林悦旻” -Chinese language
“ABC”- English language
etc.
Thanks!
vaskar
Printable View
Hi all,
Do you have some source code sample or any idea how can i identify the language type from a given String.
e.g-
“林悦旻” -Chinese language
“ABC”- English language
etc.
Thanks!
vaskar
where are you getting the string from?
maybe this might help you...jchardet.sourceforge.net
i don't know how it could possibly be accurate though...
Hi,
The code in this link is for japanese language and it may be converted to chinese also itseems as i dont know about Chinese i cant help u out with the code
Java - Chinese Language Processing and Chinese Computing
Since the a String is made up of Unicode characters, convert one of the characters in the string to an int value and use that to see where in the range of values for two byte values of characters that make up the full range of Unicode characters that it fits. For example ASCII/english chars could range from 0 - 256. To guess the language/alphabet a char was from you need a table that maps the ranges of Unicode characters for each language.
Something like: English 0-255, Japanese 1200-1400 etc for the full range of Unicode values 0-64K
Can u give me some sample code? From where do i get the unicode range table.Can u give me some url.
Thanks!
vaskar
This page may help you.
From where do i get the information
English 0-255, Japanese 1200-1400....etc?
pls help me.
You have to use UNICODE tables.
vaskarbasak, this is a remarkably involved subject. I found some work on the subject that is literally millions of lines of text, there are issues that are not apparent to native ISO Latin - 1 speakers.
Just digging through the information available would require writing specailized Java programs. It would be better if you tell us what you are trying to achieve. Java has remarkable ability to handle a String as a String without the coder trying to disentangle the Unicode Constortium.
Have you ever read an RFC?
Hey Vaskarbasak,
now i need exactly this same thing. Pls help me as i hope it 'd be solved for u by today.
Anybody has any idea please tell me.
Did you go through all the replies in this thread? There are lots of hints for you. ;)
The ans frn Nivedita found useful for my problem, for Japanese. I have this same requirement for chinese and korean as well.
And the character set range specified above,, i'm not dare enough to decide the language of the String, by just using range of values.
There in the Niveditha's link talking about UNICODE. So the thing it you have to fine the correct range of UNICODE values for correct language.
Hey Eranga,
Thanku so much.., i got it and working fine.....:)
Nice to here that. If you can I think it's better to briefly explain here how did you so it. Because in later another member can follow the way you take to solve a problem.
Ya Eranga's suggestion was correct, atleast any one else wont have to spend one more month again to find the same solution... :)
Heres the brief explanation of my problem and solution,
I want to recognize chinese(both traditional and simplified) , japanese and Korean. In our code we can't recognize these characters directly as we do for English characters/strings. This is done with the help of unicode character set. Each language has different range of values to represent their characters. For example,
for Korean language the range of values are '\uAC00' to '\uD7A3'. Which means, every korean letter has some value within this range. In this way we will come to a conclusion that this letter belong to Korean language.
Please note that above range of values belongs to Hangul Syllablus, which is a type of languages in Korean, as there are different type of Koran langs i seen(but we actually won't see much difference.).
Please make sure your java file is set to unicode(UTF-8) format.
More questions? mail me.
That's fine. :) So that anyone refer this thread can have a brief idea what he/she have to do.