Results 1 to 7 of 7
  1. #1
    dazz is offline Member
    Join Date
    Apr 2017
    Posts
    27
    Rep Power
    0

    Default Howto Detect Chinese characters in a line of text???

    Hello
    I am new to Java and I want to use the Google translate API to translate HTML files with Java documentation written in Mandarin.
    About 20% of lines contain some Mandarin. Less than 5% of words are Mandarin.
    A typical sample of the file is shown below:

    Java Code:
    ControllerConfigBxFile</a>&nbsp;ccf)
                        throws <a href="../../../onbon/bx06/Bx6GException.html" title="class in onbon.bx06">Bx6GException</a></pre>
    <div class="block">同步将屏幕参数写入控制器。</div>
    <dl><dt><span class="strong">Parameters:</span></dt><dd><code>ccf</code> - 屏幕参数。</dd>
    <dt><span class="strong">Returns:</span></dt><dd>写入结果。</dd>
    <dt><span class="strong">Throws:</span></dt>
    <dd><code><a href="../../../onbon/bx06/Bx6GException.html" title="class in onbon.bx06">Bx6GException</a></code> - 写入发生错误。</dd></dl>
    </li>
    </ul>
    The Google service is charged on words translated. I'd like to reduce the cost by only translating lines with Mandarin characters.
    The ideal would be to identify and translate only the Mandarin characters, but I don't need to get that sophisticated.

    Google provide an example of how to use the API.

    Java Code:
    // Imports the Google Cloud client library
    ///import com.google.cloud.translate.Translate;
    import com.google.cloud.translate.Translate.TranslateOption;
    import com.google.cloud.translate.TranslateOptions;
    import com.google.cloud.translate.Translation;
    
    public class Translate {
      public static void main(String... args) throws Exception {
        // Instantiates a client
        Translate translate = TranslateOptions.getDefaultInstance().getService();
    
        // The text to translate
        String text = "Hello, world!";
    
        // Translates some text into Russian
        Translation translation =
            translate.translate(
                text,
                TranslateOption.sourceLanguage("en"),
                TranslateOption.targetLanguage("ru");
    
    
        System.out.printf("Text: %s%n", text);
        System.out.printf("Translation: %s%n", translation.getTranslatedText());
      }
    }
    It is fairly simple but I need it to process files. So I wrote the following code.
    This code is just for my use, so it doesn't need any fancy stuff.
    The code that is missing is the test for Mandarin.

    Java Code:
    // Imports the Google Cloud client library
    ///import com.google.cloud.translate.Translate;
    import com.google.cloud.translate.Translate.TranslateOption;
    import com.google.cloud.translate.TranslateOptions;
    import com.google.cloud.translate.Translation;
    import java.io.IOException;
    import java.io.BufferedReader;
    import java.io.BufferedWriter;
    import java.io.FileReader;
    import java.io.FileWriter;
    import java.io.PrintWriter;
    
    public class Translate {
      public static void main(String[] args) throws Exception {
      String fpI = "D:/temp/";  //source directory
      String fpO = "D:/temp/translated/";  //destination directory
        // Instantiates a client
        Translate translate = TranslateOptions.getDefaultInstance().getService();
    
        for (String file : args) {  // for each of the files entered
            FileReader fileIn = new FileReader (fpI+file);  // the source file
            FileWriter fileOut = new FileWriter(fpO+file, true); //Output file set to append mode
            BufferedWriter bw = new BufferedWriter(fileOut);  //use the same filename as input.
            PrintWriter pw = new PrintWriter(bw);        
            try(BufferedReader br = new BufferedReader(fileIn)) {
                for(String line; (line = br.readLine()) != null; ) {
                    // line holds the string to translate    
                    // process the line.
                    // if there is a Mandarin character in the line, translate the whole line
                        Translation translation =
                            translate.translate(
                            line,
                            TranslateOption.sourceLanguage("zh-CN"),
                            TranslateOption.targetLanguage("en"));        
                        pw.println(translation.getTranslatedText());
    
                    //} else { just write the line to the output file and continue;
     
                    pw.println(translation.getTranslatedText());
                }
    
            } catch (IOException e) { } //do nothing
            fileIn.close();
            fileOut.close();       
        }  
      }
    }
    How do I test a line for any Mandarin characters??

  2. #2
    SurfMan's Avatar
    SurfMan is offline Godlike
    Join Date
    Nov 2012
    Location
    The Netherlands
    Posts
    1,989
    Rep Power
    8

    Default Re: Howto Detect Chinese characters in a line of text???

    You can iterate over each character in the String, and if its codepoint lies in the range of 4E009FFF, it is a CJK ideograph. I cooked up an example to demonstrate:
    Java Code:
    public class CJK {
        private static final int RANGE_START = 0x4e00;
        private static final int RANGE_END = 0x9FFF;
    
        public static void main(String[] args) {
            String c = "<div class=\"block\">同步将屏幕参数写入控制器。</div>";
            for (int i = 0; i < c.length(); i++) {
                int codePointAt = Character.codePointAt(c, i);
                System.out.printf("Got character: %s%n", new String(Character.toChars(codePointAt)));
                if (codePointAt >= RANGE_START && codePointAt <= RANGE_END) {
                    System.out.printf("This looks CJK: \\u%s%n", Integer.toHexString(codePointAt));
                }
            }
        }
    }
    If you check the unicode points (for example 540C) on this page: https://en.wikipedia.org/wiki/List_o...s,_part_1_of_4
    you can see the conversion is spot on. (Find row 540x then go right 13 places (the hexadecimal 'C') )

    Output:
    Java Code:
    Got character: <
    Got character: d
    Got character: i
    Got character: v
    Got character:  
    Got character: c
    Got character: l
    Got character: a
    Got character: s
    Got character: s
    Got character: =
    Got character: "
    Got character: b
    Got character: l
    Got character: o
    Got character: c
    Got character: k
    Got character: "
    Got character: >
    Got character: 同
    This looks CJK: \u540c
    Got character: 步
    This looks CJK: \u6b65
    Got character: 将
    This looks CJK: \u5c06
    Got character: 屏
    This looks CJK: \u5c4f
    Got character: 幕
    This looks CJK: \u5e55
    Got character: 参
    This looks CJK: \u53c2
    Got character: 数
    This looks CJK: \u6570
    Got character: 写
    This looks CJK: \u5199
    Got character: 入
    This looks CJK: \u5165
    Got character: 控
    This looks CJK: \u63a7
    Got character: 制
    This looks CJK: \u5236
    Got character: 器
    This looks CJK: \u5668
    Got character: 。
    Got character: <
    Got character: /
    Got character: d
    Got character: i
    Got character: v
    Got character: >
    Last edited by SurfMan; 04-11-2017 at 02:18 PM.
    "It's not fixed until you stop calling the problem weird and you understand what was wrong." - gimbal2 2013

  3. #3
    dazz is offline Member
    Join Date
    Apr 2017
    Posts
    27
    Rep Power
    0

    Default Re: Howto Detect Chinese characters in a line of text???

    Hello
    OK thanks. I have adapted your code and incorporated it into my code.
    I now have another problem. The last 40 or so lines from the input file are not saved in the output file.
    I have opened a new thread.

  4. #4
    SurfMan's Avatar
    SurfMan is offline Godlike
    Join Date
    Nov 2012
    Location
    The Netherlands
    Posts
    1,989
    Rep Power
    8

    Default Re: Howto Detect Chinese characters in a line of text???

    Quote Originally Posted by dazz View Post
    Hello
    OK thanks. I have adapted your code and incorporated it into my code.
    I now have another problem. The last 40 or so lines from the input file are not saved in the output file.
    I have opened a new thread.
    The new thread is not there, link?

    Make sure to flush the outputstreams when you're done, check for an unwritten buffer when a loop end, that sort of bugs.
    "It's not fixed until you stop calling the problem weird and you understand what was wrong." - gimbal2 2013

  5. #5
    dazz is offline Member
    Join Date
    Apr 2017
    Posts
    27
    Rep Power
    0

    Default Re: Howto Detect Chinese characters in a line of text???

    Hi
    It has to get past the moderator before showing up.
    I can see that all of the input file is being processed to the end. The last few lines don't make it to the output.
    I flush BufferedWriter before closing but that doesn't seem to have an effect.
    The source code is in the new thread.

  6. #6
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    13,541
    Rep Power
    26

    Default Re: Howto Detect Chinese characters in a line of text???

    I can't see any new thread.
    Please do not ask for code as refusal often offends.

    ** This space for rent **

  7. #7
    dazz is offline Member
    Join Date
    Apr 2017
    Posts
    27
    Rep Power
    0

    Default Re: Howto Detect Chinese characters in a line of text???

    Hi
    I will resubmit it.

Similar Threads

  1. Weird issue with Chinese characters
    By hbquikcomjamesl in forum SWT / JFace
    Replies: 4
    Last Post: 12-27-2012, 12:23 AM
  2. Chinese characters in Excel
    By Jayashreevmenon in forum Advanced Java
    Replies: 4
    Last Post: 08-04-2011, 10:51 AM
  3. Need to write chinese characters to csv file
    By Jayashreevmenon in forum Advanced Java
    Replies: 0
    Last Post: 08-04-2011, 10:50 AM
  4. Certain Chinese Characters not displayed properly.
    By kerwintang in forum Advanced Java
    Replies: 1
    Last Post: 08-21-2009, 05:55 AM
  5. drawString with Chinese Characters
    By vaskarbasak in forum Advanced Java
    Replies: 1
    Last Post: 06-10-2008, 07:49 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •