Results 1 to 9 of 9
  1. #1
    ron
    ron is offline Member
    Join Date
    Feb 2014
    Posts
    4
    Rep Power
    0

    Default Splitting a file of 5GBs into manageable chunks of 5MB

    Hi all,

    I am new to a Java. I have a small project where I have a file of 10 million records of data located under d:/test/orders.csv

    I would like to split this file into manageable file sizes of 5MB with the same name and be stored under a separate folder d:/test/output e.g
    orders1.csv
    orders2.csv
    orders3.csv
    ..............


    ordersn.csv

    Thanks,

    Ron

  2. #2
    SurfMan's Avatar
    SurfMan is offline Godlike
    Join Date
    Nov 2012
    Location
    The Netherlands
    Posts
    1,002
    Rep Power
    3

    Default Re: Splitting a file of 5GBs into manageable chunks of 5MB

    You can subclass a FileOutputStream, keep track of how many bytes you have written, then at a set threshold, start a new file.

    Edit: After writing this, I started creating an example. I am not sure it's a simple as I said it would be... :) I'll get back with an example.
    Last edited by SurfMan; 02-21-2014 at 12:46 AM.
    "It's not fixed until you stop calling the problem weird and you understand what was wrong." - gimbal2 2013

  3. #3
    SurfMan's Avatar
    SurfMan is offline Godlike
    Join Date
    Nov 2012
    Location
    The Netherlands
    Posts
    1,002
    Rep Power
    3

    Default Re: Splitting a file of 5GBs into manageable chunks of 5MB

    Apparently it *was* that easy. I whipped up an example you can use to build from. Others can do probably better, but I was bored anyway, so there you go.

    It's a simple OutputStream that keeps track of the size and delegates to FileOutputStreams creating new files as it goes. You should use it as the receiving end of a copy operation like you do with InputStreams and OutputStreams.

    Limitations: the buffersize should alwyas be smaller than the size of the destination files.

    Todo: You should override flush() and close() as well to make sure everything works like it should.

    Java Code:
    import java.io.*;
    
    public class MaxFileOutputStream extends OutputStream {
    
        private String destinationFilename;
        private File destinationDirectory;
    
        private int maxSize;
        private int bytesWritten;
    
        private int counter;
        private OutputStream destinationOutputStream;
    
        public MaxFileOutputStream(File destinationDirectory, String destinationFilename, int maxSize ) {
            this.destinationFilename = destinationFilename;
            this.destinationDirectory = destinationDirectory;
            this.maxSize = maxSize;
        }
    
        @Override
        public void write(byte[] b) throws IOException {
            checkDestinationFile();
            bytesWritten += b.length;
            destinationOutputStream.write(b);
        }
    
        @Override
        public void write(byte[] b, int off, int len) throws IOException {
            checkDestinationFile();
            bytesWritten += len;
            destinationOutputStream.write(b, off, len);
        }
    
        @Override
        public void write(int b) throws IOException {
            checkDestinationFile();
            bytesWritten++;
            destinationOutputStream.write(b);
        }
    
        private void checkDestinationFile() throws FileNotFoundException {
            if (destinationOutputStream == null || bytesWritten >= maxSize) {
                counter++;
                bytesWritten = 0;
                File destinationFile = new File(destinationDirectory, destinationFilename + "-" + counter);
                destinationOutputStream = new FileOutputStream(destinationFile);
            }
    
        }
    
        public static void main(String[] args) {
            try {
    
                FileInputStream fis = new FileInputStream(new File("some source file.txt"));
                MaxFileOutputStream test = new MaxFileOutputStream( new File("/home/user/mydir") , "some destination file", 5 * 1024 * 1024);
    
                byte[] buffer = new byte[1024];
                int len = 0;
                while (  (len = fis.read(buffer)) > 0  ) {
                    test.write(buffer, 0, len );
                }
                test.flush();
                test.close();
                fis.close();
    
            }
            catch (FileNotFoundException e) {
                e.printStackTrace();
            }
            catch (IOException e) {
                e.printStackTrace();
            }
        }
    
    }
    "It's not fixed until you stop calling the problem weird and you understand what was wrong." - gimbal2 2013

  4. #4
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    17,611
    Rep Power
    25

    Default Re: Splitting a file of 5GBs into manageable chunks of 5MB

    I assume that your csv files contains records of various lengths ending with the newline character. This code is at the byte level, not the record level. If you want the new csv files to contain whole records, this code will require some adjustments.
    If you don't understand my response, don't ignore it, ask a question.

  5. #5
    ron
    ron is offline Member
    Join Date
    Feb 2014
    Posts
    4
    Rep Power
    0

    Default Re: Splitting a file of 5GBs into manageable chunks of 5MB

    Norm, this is what I am trying to convert from Active Perl into Java


    #!/usr/bin/perl -w

    $file=$ARGV[0];

    if(!$file) {
    print "$0 <file> \n";
    exit(1);
    }


    open($fh, "< $file") or die "Unable to open $file";
    $headers=<$fh>; # Read

    $rc=0;
    $file_cnt=1;


    $file_out = $file . "_" . $file_cnt;

    while($line=<$fh>) {
    open($fho, "> $file_out") or die "Unable to open $file_out";


    while ($rc < 10000000) {
    $line=~s/\015//g;
    $line=~s/[^[:ascii:]]//g;
    $rc++;
    print $fho $line;
    $line=<$fh>;

    if (eof($fh))
    {
    close ($fh);
    close ($fho);
    exit(0);
    }
    }
    close ($fho);
    $rc = 0;


    $file_cnt= $file_cnt + 1;
    $file_out = $file . "_" . $file_cnt;
    }

    close ($fh);
    close ($fho);

    exit(0);

    The script found in this example takes in a file as an argument, opens an output file with the same name, an incremented number, and splits the file while removing extra carriage returns (015) as well as non-ASCII characters. This script is written in ActivePerl.

    Thanks,

    Ron

  6. #6
    jim829 is offline Senior Member
    Join Date
    Jan 2013
    Location
    Northern Virginia, United States
    Posts
    3,804
    Rep Power
    5

    Default Re: Splitting a file of 5GBs into manageable chunks of 5MB

    It looks to me like your inner while loop continues to read the same file for 10,000,000 records. This creates multiple files of 10,000,000 records each. So do you want to duplicate this in Java or break each file down into smaller pieces. Or am I missing something in the Perl script and your requirements?

    Regards,
    Jim


    PHP Code:
    #!/usr/bin/perl -w
    $file=$ARGV[0];
     
    if(!$file) {
      print "$0 <file> \n";
      exit(1);
    }
      
    open($fh, "< $file") or die "Unable to open $file";
    $headers=<$fh>;  # Read
     
    $rc=0;
    $file_cnt=1;
    
    $file_out = $file . "_" . $file_cnt;
     
    while($line=<$fh>) {
       open($fho, "> $file_out") or die "Unable to open $file_out";
       while ($rc < 10000000) {
            $line=~s/\015//g;
            $line=~s/[^[:ascii:]]//g;
            $rc++;
            print $fho $line;
            $line=<$fh>;
            if (eof($fh))  {
               close ($fh);
               close ($fho);
               exit(0);
           }
       }
       close ($fho);
       $rc = 0;
      
       $file_cnt= $file_cnt + 1;
       $file_out = $file . "_" . $file_cnt;
    } 
    close ($fh);
    close ($fho);
    exit(0);
    The JavaTM Tutorials | SSCCE | Java Naming Conventions
    Poor planning on your part does not constitute an emergency on my part

  7. #7
    ron
    ron is offline Member
    Join Date
    Feb 2014
    Posts
    4
    Rep Power
    0

    Default Re: Splitting a file of 5GBs into manageable chunks of 5MB

    Hi Jim,

    I would to duplicate this in Java.

    Thanks,

    Ron

  8. #8
    jim829 is offline Senior Member
    Join Date
    Jan 2013
    Location
    Northern Virginia, United States
    Posts
    3,804
    Rep Power
    5

    Default Re: Splitting a file of 5GBs into manageable chunks of 5MB

    But this program does not do what you had asked. Here is the inner loop.

    PHP Code:
    while ($rc < 10000000) {
        $line=~s/\015//g;
        $line=~s/[^[:ascii:]]//g;
        $rc++;
        print $fho $line;
        $line=<$fh>;
        if (eof($fh))  {
               close ($fh);
               close ($fho);
               exit(0);
        }
    }
    It creates 10,000,000 record files. Then the outer loop (not shown) increments the file count.You said you wanted to take a single 10,000,000 record file and convert it into multiple 5Mb files. In any event I will go by what you asked for and not what this script does. So you can take the advice given by the others on this thread. I am not certain why a csv file would contain non-ascii characters. If they were unicode or some other encoding (e.g. EBCDIC, ugh) I would think they would be an essential part of the data and you would want to retain them. But that could be ignorance on my part.

    Regards,
    Jim
    The JavaTM Tutorials | SSCCE | Java Naming Conventions
    Poor planning on your part does not constitute an emergency on my part

  9. #9
    ron
    ron is offline Member
    Join Date
    Feb 2014
    Posts
    4
    Rep Power
    0

    Default Re: Splitting a file of 5GBs into manageable chunks of 5MB

    Jim,

    I trying to out a concept where I have a either a large csv or delimited file then split the large file into large manageable chucks of files to allow parallel processing by amazon red shift as it loads the data into the database table. I got this code from a book where someone was using it to split the large into multiple small files while removing extra carriage returns (015) as well as non-ASCII characters.

    Thanks,

    Ron

Similar Threads

  1. Splitting
    By marksey07 in forum New To Java
    Replies: 2
    Last Post: 01-05-2011, 02:55 AM
  2. Sending and splitting an image file over tcp socket.
    By busdude in forum Advanced Java
    Replies: 1
    Last Post: 12-02-2010, 11:03 AM
  3. Reading Chunks of Data
    By icu222much in forum New To Java
    Replies: 3
    Last Post: 03-22-2010, 09:42 AM
  4. String Splitting
    By A.M.S in forum New To Java
    Replies: 1
    Last Post: 12-04-2009, 08:17 AM
  5. continuous playback of chunks
    By arnab321 in forum CLDC and MIDP
    Replies: 0
    Last Post: 12-11-2008, 09:46 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •