Results 1 to 9 of 9
  1. #1
    kosmos890 is offline Member
    Join Date
    Apr 2012
    Posts
    40
    Rep Power
    0

    Default How check if a file is plain text or binary?

    I want to parse an html text file using Jsoup.
    How check if a file is plain text or binary?

  2. #2
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,308
    Rep Power
    25

    Default Re: How check if a file is plain text or binary?

    Rad the file char by char and check if that char is a valid text character.
    If you don't understand my response, don't ignore it, ask a question.

  3. #3
    gimbal2 is offline Just a guy
    Join Date
    Jun 2013
    Location
    Netherlands
    Posts
    3,877
    Rep Power
    5

    Default Re: How check if a file is plain text or binary?

    I would check the first X chars only though, not the entire thing. Generally you can know if it is a html file if it starts with <html or <doctype in any form of capitalization, skipping any whitespace.
    "Syntactic sugar causes cancer of the semicolon." -- Alan Perlis

  4. #4
    AndrewM16921 is offline Senior Member
    Join Date
    Jan 2009
    Location
    NJ, USA
    Posts
    263
    Rep Power
    6

    Default Re: How check if a file is plain text or binary?

    Agree with that ^

    But, my question is, if it's an html file shouldn't it always be text? Why would a binary file have a .html extension?

  5. #5
    gimbal2 is offline Just a guy
    Join Date
    Jun 2013
    Location
    Netherlands
    Posts
    3,877
    Rep Power
    5

    Default Re: How check if a file is plain text or binary?

    The extension says nothing about the content of the file, it only provides an assumption which in Windows environments will usually be correct due to how the operating system depends on it too, but in other environments doesn't have to be. You can also get a file which does not have a .html extension but does have .html data, or something that is not even a file (for example: data fetched from the web). That's why you should always look at the data itself and not the name of the resource to make judgements about it.
    "Syntactic sugar causes cancer of the semicolon." -- Alan Perlis

  6. #6
    kosmos890 is offline Member
    Join Date
    Apr 2012
    Posts
    40
    Rep Power
    0

    Default Re: How check if a file is plain text or binary?

    Thanks for your replies

    I have written this method to read the file char by char and check. It works for latin symbols
    I can't detect some greek symbols. E.g encoding windows-1253 (greek) symbol 65533
    I used greek symbols ascii codes from here and here
    Text file encodings
    • utf8
    • iso-8859-1
    • iso-8859-7 (greek)
    • windows-1252
    • windows-1253 (greek)


    Java Code:
    public boolean check(String filename) {
    
            boolean result = false;
    
            FileReader inputStream = null;
    
            try {
                inputStream = new FileReader(filename);
    
                int c;
                while ((c = inputStream.read()) != -1) {
    
    //                (10)Line feed  (11)Vertical tab (13)Carriage return (32)Space (126)tilde
                    if (c == 10 || c == 11 || c == 13 || (c >= 32 && c <= 126) ||) {
                        result = true;
                        
    //                       (153)Superscript two (160)ϊ  (255) No break space                     
                    } else if(c == 153 || c >= 160 && c <= 255){
    					 result = true;
    					 		
    //                        (884)ʹ (885)͵ (890)ͺ (894); (900)' (974)ώ 
                    } else if (c == 884 || c == 885 || c == 890 || c == 894 || c >= 900 && c <= 974 ) {
                        result = true;
                        
                    } else {
                        System.out.println(c + " " + (char) c);
                        result=false;
                        break;
                    }
                  
                    }
                }
            } catch (FileNotFoundException ex) {
            } catch (IOException ex) {
            } finally {
    
                if (inputStream != null) {
                    try {
                        inputStream.close();
                    } catch (IOException ex) {
                    }
                }
            }
            return result;
        }
    ---------- EDIT ----------

    I prefer to use numerical representation instead of symbols because I don't know how to insert greek symbols (ά,β,γ,δ,έ,ζ,ή,θ....) into my code
    Last edited by kosmos890; 09-28-2013 at 06:32 PM.

  7. #7
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,308
    Rep Power
    25

    Default Re: How check if a file is plain text or binary?

    Some parts of The code would be easier to understand if char values were used instead of int values.
    Not everyone knows what char is 10, 13, 32, 126, etc
    Last edited by Norm; 09-28-2013 at 05:09 PM.
    If you don't understand my response, don't ignore it, ask a question.

  8. #8
    kosmos890 is offline Member
    Join Date
    Apr 2012
    Posts
    40
    Rep Power
    0

    Default Re: How check if a file is plain text or binary?

    I have problems with greek chars.
    As a solution I use Character.UnicodeBlock to detect latin and greek chars.
    How can I improve my code?

    Java Code:
    public boolean check(String filename) {
    
            boolean result = false;
    
            FileReader inputStream = null;
    
            try {
                inputStream = new FileReader(filename);
    
                int c;
                while ((c = inputStream.read()) != -1) {
    
                    Character.UnicodeBlock block = Character.UnicodeBlock.of(c);
                    
                    if (block == Character.UnicodeBlock.BASIC_LATIN || block == Character.UnicodeBlock.GREEK) {
    //                     (9)Horizontal Tab (10)Line feed  (11)Vertical tab (13)Carriage return (32)Space (126)tilde
                        if (c==9 || c == 10 || c == 11 || c == 13 || (c >= 32 && c <= 126)) {
                            result = true;
    
    //                            (153)Superscript two (160)ϊ  (255) No break space                     
                        } else if (c == 153 || c >= 160 && c <= 255) {
                            result = true;
    
    //                            (884)ʹ (885)͵ (890)ͺ (894); (900)' (974)ώ     
                        } else if (c == 884 || c == 885 || c == 890 || c == 894 || c >= 900 && c <= 1019) {
                            result = true;
    
                        } else {                        
                            result = false;
                        break;
                        }
                    }                
                }
            } catch (FileNotFoundException ex) {
            } catch (IOException ex) {
            } finally {
    
                if (inputStream != null) {
                    try {
                        inputStream.close();
                    } catch (IOException ex) {
                    }
                }
            }
            return result;
        }

  9. #9
    jim829 is offline Senior Member
    Join Date
    Jan 2013
    Location
    Northern Virginia, United States
    Posts
    3,512
    Rep Power
    5

    Default Re: How check if a file is plain text or binary?

    Actually, IMHO, this is not an exact science. I would focus on structure and statistics instead of checking individual characters. What if you come across some text based programming language that has a string of special characters? With a little research you could find some papers on the overall layout of text based files verse binary. I would imagine that text based files would typically have white space spread through out. For special files like base64 and uuecode which are void of white space you can check the character sets or the first and last delimiters.

    Regards,
    Jim
    The Java™ Tutorial | SSCCE | Java Naming Conventions
    Poor planning our your part does not constitute an emergency on my part.

Similar Threads

  1. Scanning a Plain text file
    By aw249 in forum New To Java
    Replies: 3
    Last Post: 02-03-2012, 05:16 PM
  2. age old problem - best way to check if file is text
    By madroadbiker in forum Advanced Java
    Replies: 11
    Last Post: 05-27-2011, 06:57 PM
  3. update binary file from text file
    By billdef in forum New To Java
    Replies: 8
    Last Post: 09-02-2010, 09:24 AM
  4. Replies: 3
    Last Post: 06-08-2010, 08:10 PM
  5. Converting a text file int binary
    By sruthi_2009 in forum New To Java
    Replies: 0
    Last Post: 03-23-2009, 03:09 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •