Results 1 to 2 of 2
Like Tree1Likes
  • 1 Post By JosAH

Thread: Read unicode code points from file to real characters

  1. #1
    lordelf2004 is offline Member
    Join Date
    Jul 2011
    Posts
    1
    Rep Power
    0

    Smile Read unicode code points from file to real characters

    Hii all,

    Recently, I need to generate Ngram from a text file which contains Unicode code point (E.g: \u042D, \u0441, \u043A, ...). To do it, I have to convert these code point into real characters first; but, I'm still stuck on it.

    It's supposed that content from my text file ("I:/unicode.txt") is string like "8 \u042D\u0441\u043A\u0435-\u041D\u043E\u0442\u0440-\u0414\u0430\u043CAAAA abcde". Problem is:
    + When I read by line and print out that string into my console, the output is the same with above unicode code point.
    + But, when I try to copy content of that file to a String type in Java, it comes out with real characters like " Э с к е ..."

    Anyone can tell me the reason why I cannot print real characters from Unicode code point from my text file in that way?

    Any suggestions, discussions are appreciated!
    Thanks in advance.


    /////////////////////////////////////////////////////////////////////////////////////////////////////////
    Java Code:
    public static void main(String[] args) throws IOException {
    String str1 = "8 \u042D\u0441\u043A\u0435-\u041D\u043E\u0442\u0440-\u0414\u0430\u043CAAA abcde";
    String str2 = getStringFromFile();
    System.out.println(str1); // print real characters: 8 Эске-Нотр-ДамAAA abc
    System.out.println(str2); // print Unicode code point:8 \u042D\u0441\u043A\u0435-\u041D\u043E\u0442\u0440-\u0414\u0430\u043CAAAA abcde
    }

    My function to read text file like this:

    Java Code:
    public static String getStringFromFile() throws FileNotFoundException, IOException {
    FileInputStream fis = new FileInputStream("I:/unicode.txt");
    DataInputStream dis = new DataInputStream(fis);
    BufferedReader br = new BufferedReader(new InputStreamReader(dis));
    String line = "";
    String content = "";
    while ((line = br.readLine()) != null) {
    content += line;
    }
    br.close();
    dis.close();
    return content;
    }
    /////////////////////////////////////////////////////////////////////////////////////////////////////////

  2. #2
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,763
    Blog Entries
    7
    Rep Power
    21

    Default

    If you put those code points in a literal String in your Java source code it's the compiler that does the translation for you; if you just read lines from a file nothing translates those codepoints for you, you have to do it yourself; here's a quick snippet that does the translation of a line for you (just like the compiler does):

    Java Code:
    	public static String process(String line) {
    		
    		Pattern p= Pattern.compile("\\\\u[0-9A-Za-z]{4}");
    		
    		StringBuilder sb= new StringBuilder(line);
    		for (Matcher m= p.matcher(sb.toString()); m.find(); m= p.matcher(sb.toString())) {
    			int beg= m.start();
    			int end= m.end();
    
    			sb.replace(beg, end, ""+(char)Integer.parseInt(sb.substring(beg+2, end), 16));
    		}
    		return sb.toString();
    	}
    kind regards,

    Jos
    Last edited by JosAH; 07-30-2011 at 01:08 PM. Reason: simplified the code a bit
    JeffGrigg likes this.
    cenosillicaphobia: the fear for an empty beer glass

Similar Threads

  1. Displaying Unicode Characters
    By whatif in forum Advanced Java
    Replies: 2
    Last Post: 02-24-2011, 03:44 AM
  2. Replies: 2
    Last Post: 02-09-2011, 05:12 PM
  3. Replies: 23
    Last Post: 08-12-2010, 10:59 AM
  4. Problem with writing unicode characters in a file
    By ze snow in forum New To Java
    Replies: 1
    Last Post: 02-23-2010, 11:47 PM
  5. writing and reading unicode characters from a file
    By ranoosh in forum Advanced Java
    Replies: 4
    Last Post: 09-28-2008, 05:34 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •