Results 1 to 20 of 20
  1. #1
    gezzel is offline Member
    Join Date
    Sep 2008
    Posts
    16
    Rep Power
    0

    Default Java heap space error

    I'm trying to read in a fairly large csv file (~160 MEGS).
    Java Code:
    import java.util.regex.Pattern;
    import java.util.regex.Matcher;
    import java.io.*;
    
    /**
     * @author Lawrence
     */
    public class RegexFormat {   
        
        public static void main(String args[]) {
        
            String s1 = " & ";         //To " and "
            String s2 = "&";           //To " and "
            String s3 = "Tampa";       //To Boston
            String s4 = "Florida";     //To Massachustts
            String s5 = "Carolinas";   //To New England
            String s6 = "Richmond";    //To Providence
            String s7 = "Southern";    //To Northern
            String s8 = "Energy";      //To Power
            
            String f1 = " and ";
            String f2 = " and ";
            String f3 = "Boston";
            String f4 = "Massachusetts";
            String f5 = "New England";
            String f6 = "Providence";
            String f7 = "Northern";
            String f8 = "Power";
            
            //Array of Replacements
            String[] ListToFind = {s1, s2, s3, s4, s5, s6, s7, s8};
            String[] ListToReplace = {f1, f2, f3, f4, f5, f6, f7, f8};
            
    
            //Patterns 
            Pattern p1 = Pattern.compile(s1, Pattern.CASE_INSENSITIVE);        //To " and "
            Pattern p2 = Pattern.compile(s2, Pattern.CASE_INSENSITIVE);          //To " and "
            Pattern p3 = Pattern.compile(s3, Pattern.CASE_INSENSITIVE);      //To Boston
            Pattern p4 = Pattern.compile(s4, Pattern.CASE_INSENSITIVE);    //To Massachustts
            Pattern p5 = Pattern.compile(s5, Pattern.CASE_INSENSITIVE);  //To New England
            Pattern p6 = Pattern.compile(s6, Pattern.CASE_INSENSITIVE);   //To Providence
            Pattern p7 = Pattern.compile(s7, Pattern.CASE_INSENSITIVE);   //To Northern
            Pattern p8 = Pattern.compile(s8, Pattern.CASE_INSENSITIVE);     //To Power
        
        Pattern[] toFindArray = {p1, p2, p3, p4, p5, p6, p7, p8};
        String testInput = "I am going to tampa, florida which is near Carolinas & Richmond. Paul&I will be going to the Southern part to research Energy";
        
        //Init first matcher
        //Matcher toMatch = toFindArray[0].matcher(testInput);
        String strInput = "x";
        try{
            byte[] FileInput = ReadFile(args[0]);
            // make a backup copy			
            WriteFile(args[0]+".backup.copy",FileInput);
            strInput = new String(FileInput);
                //loop to clean data
            for (int i = 0; i < 8; i++) {
                Matcher toMatch;
                toMatch = toFindArray[i].matcher(strInput);
                strInput = toMatch.replaceAll(ListToReplace[i]);
            }
            WriteFile(args[0],strInput.getBytes());
        }
        catch(Exception e){			
            System.out.println(e.getMessage());		
        }
        
        
            Console console = System.console();
            System.out.println(strInput);    //System.out.println("The element at [0][2] is " + replaceArray[4][1]);
        }
        
        
        //To Read a File into a Byte Array
        static public final byte[] ReadFile(String strFile) throws IOException {
            int nSize = 32768;
            // open the input file stream
            BufferedInputStream inStream = new BufferedInputStream(new FileInputStream(strFile), nSize);
            byte[] pBuffer = new byte[nSize];
            int nPos = 0;
            // read bytes into a buffer
            nPos += inStream.read(pBuffer, nPos, nSize - nPos);
            // while the buffer is filled, double the buffer size and read more
            while (nPos == nSize) {
                byte[] pTemp = pBuffer;
                nSize *= 2;
                pBuffer = new byte[nSize];
                System.arraycopy(pTemp, 0, pBuffer, 0, nPos);
                nPos += inStream.read(pBuffer, nPos, nSize - nPos);
            }
            // close the input stream
            inStream.close();
            if (nPos == 0) {
                return "".getBytes();
            }
            // return data read into the buffer as a byte array
            byte[] pData = new byte[nPos];
            System.arraycopy(pBuffer, 0, pData, 0, nPos);
            return pData;
        }
    
        //To Write to File
        static public final void WriteFile(String strFile, byte[] pData) throws IOException {
            BufferedOutputStream outStream = new BufferedOutputStream(new FileOutputStream(strFile), 32768);
            if (pData.length > 0) {
                outStream.write(pData, 0, pData.length);
            }
            outStream.close();
        }
            
                
    }

    This is the error i get when i try to run 'java RegexFormat myfile.csv'
    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at RegexFormat.ReadFile(regexformat.java:105)
    at RegexFormat.main(regexformat.java:70)

    i tried to run with -Xmx512m and got this error...
    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringCoding$StringDecoder.decode(String Coding.java:133)
    at java.lang.StringCoding.decode(StringCoding.java:17 3)
    at java.lang.StringCoding.decode(StringCoding.java:18 5)
    at java.lang.String.<init>(String.java:571)
    at java.lang.String.<init>(String.java:594)
    at RegexFormat.main(regexformat.java:73)

  2. #2
    gezzel is offline Member
    Join Date
    Sep 2008
    Posts
    16
    Rep Power
    0

    Default

    This actually compiles/runs under java version "1.4.2_05"
    but not java version "1.6.0_10-rc2"

  3. #3
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    16,608
    Rep Power
    23

    Default

    Do you really need to read all of that large of a file into memory?

  4. #4
    gezzel is offline Member
    Join Date
    Sep 2008
    Posts
    16
    Rep Power
    0

    Default

    Yea, that's my task is to take in large .csv files for formatting... Not sure how else i could get around it.

  5. #5
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    16,608
    Rep Power
    23

    Default

    It could depend on the data. Can it be broken up into sub parts?
    What are the formatting rules?

  6. #6
    gezzel is offline Member
    Join Date
    Sep 2008
    Posts
    16
    Rep Power
    0

    Default

    i don't believe the file can be broken up. The purpose of the program is automate/make it easier to format by allowing the user to just select a file and have it format(replace 8 strings ie. " & " -> " and ", "Northern" -> "Southern", nothing that complex)

  7. #7
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    16,608
    Rep Power
    23

    Default

    Are you saying that:
    1)there is only one record in the file and that you have to read the whole record before being able to process it?
    2) you don't have all of the editting instructions/formatting rules until all of the file has been read.

  8. #8
    gezzel is offline Member
    Join Date
    Sep 2008
    Posts
    16
    Rep Power
    0

    Default

    i guess i was unsure what you meant by broken up. I guess if the program can parse and process the file line by line it would work. I ahve all the formatting rules already, it's like 8 find+replace strings. Not sure how i would process and build a new string and write that to a file doing it line by line...

  9. #9
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    16,608
    Rep Power
    23

    Default

    The algorithm could be as simple as:
    looping thru all lines in the file:
    read a line from file1
    edit the line as per rules.
    write new line to file2
    end loop
    close both files.
    rename/delete input file1
    rename file2 to ...

  10. #10
    Eranga's Avatar
    Eranga is offline Moderator
    Join Date
    Jul 2007
    Location
    Colombo, Sri Lanka
    Posts
    11,372
    Blog Entries
    1
    Rep Power
    19

    Default

    Quote Originally Posted by gezzel View Post
    i guess i was unsure what you meant by broken up. I guess if the program can parse and process the file line by line it would work. I ahve all the formatting rules already, it's like 8 find+replace strings. Not sure how i would process and build a new string and write that to a file doing it line by line...
    Mean by broken is working on part of data at a time, or small portion at a time. In Java dealing with files, the most perfect way is read line by line. Norm explain that how the algorithm looks like.

  11. #11
    jason wang is offline Member
    Join Date
    Sep 2008
    Posts
    16
    Rep Power
    0

    Default

    I think that you can read it through lines, and then store it into the database. then you can analyze it through the database api.

  12. #12
    gezzel is offline Member
    Join Date
    Sep 2008
    Posts
    16
    Rep Power
    0

    Default

    Are you saying i should append each line of data that i process into a file directly, or build the output string line by line then write that whole string to a file?

    so Far i haev this...
    Java Code:
    public void ReadReplace(String fileLocation) {
            try {
                File FileInput = new File(fileLocation);
                FileReader reader = new FileReader(FileInput);
                BufferedReader in = new BufferedReader(reader);
                String oneLine;
                String buildMe = "";
                while ((oneLine = in.readLine()) != null) {
                    
                    //To Do Find/Replace on Single Line
                    for (int i = 0; i < 8; i++) {
                        Matcher toMatch;
                        toMatch = toFindArray[i].matcher(oneLine);
                        oneLine = toMatch.replaceAll(ListToReplace[i]);
                    }
                    //Append to buildMe String
                    buildMe = buildMe + oneLine;
                }
                in.close();
                
            } catch (IOException e) {
                e.printStackTrace();
        }
        }
    I think i'm supposed to append to a file, but don't know how.

  13. #13
    gezzel is offline Member
    Join Date
    Sep 2008
    Posts
    16
    Rep Power
    0

    Default

    I just ran into another problem processing line by line, it's VERY slow. I hae the program temporarily appending to a string(buildMe).
    The program has run for over 10 minutes now and i don't believe it's done processing... I dont know if this will speed up when i change code to write directly to a file instead of appending a temporary string

    Here's the code i have now, is there anyway to optimize the speed?
    Java Code:
    public class RegexFormat2 {
    
        static String s1 = " & ";         //To " and "
        static String s2 = "&";           //To " and "
        static String s3 = "Tampa";       //To Boston
        static String s4 = "Florida";     //To Massachustts
        static String s5 = "Carolinas";   //To New England
        static String s6 = "Richmond";    //To Providence
        static String s7 = "Southern";    //To Northern
        static String s8 = "Energy";      //To Power
        
        static String f1 = " and ";
        static String f2 = " and ";
        static String f3 = "Boston";
        static String f4 = "Massachusetts";
        static String f5 = "New England";
        static String f6 = "Providence";
        static String f7 = "Northern";
        static String f8 = "Power";
    
        //Array of Replacements
        static String[] ListToFind = {s1, s2, s3, s4, s5, s6, s7, s8};
        static String[] ListToReplace = {f1, f2, f3, f4, f5, f6, f7, f8};
        //Patterns
        static Pattern p1 = Pattern.compile(s1, Pattern.CASE_INSENSITIVE);        //To " and "
        static Pattern p2 = Pattern.compile(s2, Pattern.CASE_INSENSITIVE);          //To " and "
        static Pattern p3 = Pattern.compile(s3, Pattern.CASE_INSENSITIVE);      //To Boston
        static Pattern p4 = Pattern.compile(s4, Pattern.CASE_INSENSITIVE);    //To Massachustts
        static Pattern p5 = Pattern.compile(s5, Pattern.CASE_INSENSITIVE);  //To New England
        static Pattern p6 = Pattern.compile(s6, Pattern.CASE_INSENSITIVE);   //To Providence
        static Pattern p7 = Pattern.compile(s7, Pattern.CASE_INSENSITIVE);   //To Northern
        static Pattern p8 = Pattern.compile(s8, Pattern.CASE_INSENSITIVE);     //To Power
        static Pattern[] toFindArray = {p1, p2, p3, p4, p5, p6, p7, p8};
    
        public static void main(String args[]) {     
            System.out.println("Running on: " + args[0]);
            ReadReplace(args[0]);
      
        public static void ReadReplace(String fileLocation) {
            try {
                File FileInput = new File(fileLocation);
                FileReader reader = new FileReader(FileInput);
                BufferedReader in = new BufferedReader(reader);
                String oneLine;
                String buildMe = "";
                while ((oneLine = in.readLine()) != null) {
    
                    //To Do Find/Replace on Single Line
                    for (int i = 0; i < 8; i++) {
                        Matcher toMatch;
    
                        //init Pattern to find
                        toMatch = toFindArray[i].matcher(oneLine);
                        //System.out.println("Current oneLine:" + oneLine);
                        oneLine = toMatch.replaceAll(ListToReplace[i]);
                        //System.out.println("Current Temp:" + temp);
                    }
                    //Append to buildMe String
                    buildMe = buildMe + "\n";
                    buildMe = buildMe + oneLine;
                }
                in.close();
                System.out.println(buildMe);
    
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

  14. #14
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    16,608
    Rep Power
    23

    Default

    processing line by line, it's VERY slow.
    buildMe = buildMe + oneLine;
    That concatenation above will be slow. It would be a bit faster if you used a StringBuffer to build the string in. But there is no reason to keep the processed string in memory. Write it to a file.
    Is there a reason you decided NOT to write the converted line directly to a file?
    will speed up when i change code to write directly to a file
    Yes it will be a lot faster. Just a little slower than the time to copy the file.

  15. #15
    gezzel is offline Member
    Join Date
    Sep 2008
    Posts
    16
    Rep Power
    0

    Default

    This is the code i have now. But there are 2 main problems.

    1. it still runs very slow, Takes atleast 15 minutes to process this file. Any tips on optimizing the speed? The for loop runs 8 different patterns against each line.

    2. The end file size is much larger than expected. I manually formatted a csv file with notepad and the resulting text was ~3mbs in difference, but the resulting file after running this code is nearly double the initial size.

    Java Code:
        public static void ReadReplace(String fileLocation) {
            try {
                FileOutputStream fos;
                DataOutputStream dos;
            
                File FileOutput = new File("cleaned_"+ fileLocation );
                fos = new FileOutputStream(FileOutput);
                dos = new DataOutputStream(fos);
                File FileInput = new File(fileLocation);
                FileReader reader = new FileReader(FileInput);
                BufferedReader in = new BufferedReader(reader);
                String oneLine;
                //String buildMe = "";
                while ((oneLine = in.readLine()) != null) {
    
                    //To Do Find/Replace on Single Line
                    for (int i = 0; i < 8; i++) {
                        Matcher toMatch;
    
                        //init Pattern to find
                        toMatch = toFindArray[i].matcher(oneLine);
                        //System.out.println("Current oneLine:" + oneLine);
                        oneLine = toMatch.replaceAll(ListToReplace[i]);
                        //System.out.println("Current Temp:" + temp);
                    }
                    //Append to buildMe String
                    dos.writeChars(oneLine);
                    dos.writeChars("\n");
                }
                in.close();
                fos.close();
                dos.close();
    
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

  16. #16
    jurka is offline Member
    Join Date
    Jul 2008
    Posts
    67
    Rep Power
    0

    Default

    what about using NIO library for this ?

  17. #17
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    16,608
    Rep Power
    23

    Default

    resulting file after running this code is nearly double the initial size.
    Can you look at the output file and see what is in it?
    Why do you use a DataOutputStream & writeChars vs PrintStream & println?

    Write a test program to see where the time is lost.
    Have a smaller input file. Use System.currentTimeMillis() to capture the times for the loop. Compare the times of a version of the program with the replacments with the times for a version of the program without the replacements(ie just a file copy).
    Also with another version that just reads the input file.

    Create an array of Matchers outside the loop and use it vs calling the matcher() method every time inside the loop.
    Last edited by Norm; 09-24-2008 at 04:25 PM.

  18. #18
    gezzel is offline Member
    Join Date
    Sep 2008
    Posts
    16
    Rep Power
    0

    Default

    I just found out that my original file was encoded in Western (Iso8859-1) but the resulting file that's created is encoded in UTF16... Is there a way i can force it to encode back in ISO8859? i tried to 'javac -encoding utf8' to see if it would encode in UTF8 but it still produced the file as UTF16.

    Jurka: what's the NIO library?

  19. #19
    gezzel is offline Member
    Join Date
    Sep 2008
    Posts
    16
    Rep Power
    0

    Default

    Wouldn't i have to call the matcher() inside the loop as it processes each line at a time, and matcher() has to be called with the line it's working on...

    I changed dataoutputstream and writechars to printstream and println. It seems to already be speeding it up(i can see the file size increase as the program runs...).

  20. #20
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    16,608
    Rep Power
    23

    Default

    Yes you are right. It needs to be inside the loop.
    As an exercise I wrote some code to time how long regex took vs using indexOf and substring. The older methods were from 2 to 3 times faster. That was being case sensitive. With case insensitive it was a little less than twice as fast.

Similar Threads

  1. java.lang.OutOfMemoryError: Java heap space
    By paul in forum Advanced Java
    Replies: 11
    Last Post: 06-12-2010, 05:30 PM
  2. Java Heap Space
    By sandeeprao.techno in forum Advanced Java
    Replies: 19
    Last Post: 10-30-2008, 11:27 AM
  3. Replies: 16
    Last Post: 07-24-2008, 11:45 AM
  4. Java heap space?
    By javanewbie in forum New To Java
    Replies: 1
    Last Post: 06-24-2008, 06:55 PM
  5. Replies: 14
    Last Post: 06-12-2008, 08:36 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •