Results 1 to 5 of 5
  1. #1
    skiforfun is offline Member
    Join Date
    May 2011
    Posts
    3
    Rep Power
    0

    Default Java UTF-16 Encoding

    I'm doing some research work on java and unicode, and I need the algorithm that codes a unicode point into bytes (How it is EXACTLY implemented by Java). I found that this function is described in the Character.java library, but unfortunately it only shows the following casting to get the char byte(s) from a unicode point value:

    Java Code:
    public static char[] toChars(int codePoint) {
    ....
    [INDENT]return new char[] { (char) codePoint };[/INDENT]
    }
    I tried to find elsewhere how this conversion is done, but no luck so far.

    Does anyone know where can I find how the conversion from a code point value to a byte is actually done ?

    Thanks in advance for any help provided :)

  2. #2
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,783
    Blog Entries
    7
    Rep Power
    21

    Default

    Quote Originally Posted by skiforfun View Post
    I'm doing some research work on java and unicode, and I need the algorithm that codes a unicode point into bytes (How it is EXACTLY implemented by Java). I found that this function is described in the Character.java library, but unfortunately it only shows the following casting to get the char byte(s) from a unicode point value:

    Java Code:
    public static char[] toChars(int codePoint) {
    ....
    [INDENT]return new char[] { (char) codePoint };[/INDENT]
    }
    I tried to find elsewhere how this conversion is done, but no luck so far.

    Does anyone know where can I find how the conversion from a code point value to a byte is actually done ?

    Thanks in advance for any help provided :)
    Strange, because this is what my copy of the source code says:

    Java Code:
        public static char[] toChars(int codePoint) {
            if (codePoint < 0 || codePoint > MAX_CODE_POINT) {
                throw new IllegalArgumentException();
            }
            if (codePoint < MIN_SUPPLEMENTARY_CODE_POINT) {
                    return new char[] { (char) codePoint };
            }
            char[] result = new char[2];
            toSurrogates(codePoint, result, 0);
            return result;
        }
    Pay special attention to the toSurrogates( ... ) method.

    kind regards,

    Jos
    Last edited by JosAH; 05-03-2011 at 09:04 PM.
    cenosillicaphobia: the fear for an empty beer glass

  3. #3
    skiforfun is offline Member
    Join Date
    May 2011
    Posts
    3
    Rep Power
    0

    Default

    The problem remains there, in your code you are in fact returning:
    Java Code:
    return new char[] { [B](char) codePoint[/B] };
    Which doesn't explain how that particular conversion is done. The surrogate pair distinction is just done to return 2 chars instead of 1, but the conversion is done in the same way. What i need to know is how does it cast an integer value (codePoint) to a char (bytes).

  4. #4
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,783
    Blog Entries
    7
    Rep Power
    21

    Default

    Quote Originally Posted by skiforfun View Post
    The problem remains there, in your code you are in fact returning:
    Java Code:
    return new char[] { [B](char) codePoint[/B] };
    Which doesn't explain how that particular conversion is done. The surrogate pair distinction is just done to return 2 chars instead of 1, but the conversion is done in the same way. What i need to know is how does it cast an integer value (codePoint) to a char (bytes).
    Conversion from an int B0B1B2B3, where B3 is the least significant byte, to a char is just B2B3. (the most significant bytes B0 and B1 are chopped off).

    kind regards,

    Jos
    cenosillicaphobia: the fear for an empty beer glass

  5. #5
    skiforfun is offline Member
    Join Date
    May 2011
    Posts
    3
    Rep Power
    0

    Default

    You are right, i was forgetting that the BMP plane of characters are just encoded as their numerical value in UTF-16 so there is no need to encode anything unless they are surrogates. The surrogates encoding I get it.

    Thanks for your time.

Similar Threads

  1. Encoding in java.io.writer
    By hariharabalan in forum New To Java
    Replies: 1
    Last Post: 12-06-2010, 11:27 AM
  2. Need encoding for Korean
    By RamaNalayini in forum Advanced Java
    Replies: 1
    Last Post: 11-25-2010, 03:34 PM
  3. encoding > java, javamail and mysql
    By litpuvn in forum Advanced Java
    Replies: 6
    Last Post: 10-21-2010, 04:35 PM
  4. Character encoding in Java (Linux to Windows)
    By BeholdMyGlory in forum New To Java
    Replies: 2
    Last Post: 01-16-2009, 07:24 PM
  5. Some help with encoding...
    By nm123 in forum Networking
    Replies: 0
    Last Post: 04-15-2008, 01:22 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •