Problem with encoding Russian text between UTF-8 and Unicode
Hello!
Not so long ago I tried to encode/decode the Russian text from Unicode to UTF-8 and back. And I discovered that Java doesn't like Russian letter 'И' ('\u0418'). Here is my code and result of its work:
class Basic
{
public static void main(String[] args) throws Exception
{
String s = "";
for(char ch=0x0410; ch<=0x044F; ch++)
s += ch;
System.out.println(s);
s = new String(s.getBytes("UTF-8"));
System.out.println(s);
s = new String(s.getBytes(), "UTF-8");
System.out.println(s);
}
}
What I see in my console:
АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъ ыьэюя
АБВГДЕЖЗР?ЙКЛМНОПР*СТУФХЦЧШ ЩЪЫЬР*ЮЯабвгдежзийклмнопрс туфхцчшщъыьэюя
АБВГДЕЖЗ??ЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъ ыьэюя
So, the only symbol which is distorted by this encoding/decoding process is 'И' ('\u0418'). But it bothers me very much. What could I do to avoid this problem?
Re: Problem with encoding Russian text between UTF-8 and Unicode
The default encoding on your system is not UTF-8 (the second line of output tells you so); you should set both the encoding as well as the decoding to UTF-8; as in:
Code:
s= new String(s.getBytes("UTF-8"), "UTF-8");
kind regards,
Jos
Re: Problem with encoding Russian text between UTF-8 and Unicode
Thank you very much. I see that I should specify charset of encoding any time not relying on the system defaults.
Re: Problem with encoding Russian text between UTF-8 and Unicode
Quote:
Originally Posted by
Dyukon
Thank you very much. I see that I should specify charset of encoding any time not relying on the system defaults.
Yep, as long as you realize that both the encoding as well as the decoding parts are equally important, that entire unicode hoopla is easy.
kind regards,
Jos