Results 1 to 18 of 18
  1. #1
    atomant is offline Member
    Join Date
    May 2010
    Posts
    45
    Rep Power
    0

    Default Searching Through Large Amounts of Data

    I have a CSV with 177million rows and 6 columns. I want to search through the first column, which is in numeric order, and get that row. I have tried SQL and importing it would take way too long, and I can't add the CSV as an array in java because it runs out of memory. There isn't a way to read from a particular line number of a file without going through each line first is there? Any suggestions on what would be the quickest way to search through this data?

  2. #2
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    16,544
    Rep Power
    23

    Default Re: Searching Through Large Amounts of Data

    Is the file sorted on the data in the column you are searching? If so you could use a RandomAccessFile to do a binary search (of sorts). If you can start reading the file at any point and find the start of a record, based on the location of a line end character, you should be able to use a binary search technique. Start in the middle, find the start of the next record, test its value and chose next next read point based on > or <
    If you don't understand my response, don't ignore it, ask a question.

  3. #3
    atomant is offline Member
    Join Date
    May 2010
    Posts
    45
    Rep Power
    0

    Default Re: Searching Through Large Amounts of Data

    Yes the data is sorted. Being able to do a binary search would be perfect, I was under the impression you could only read a file starting from line 0, I'll look into RandomAccessFile. Thanks

  4. #4
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    16,544
    Rep Power
    23

    Default Re: Searching Through Large Amounts of Data

    The trick will be finding the beginning of a line/record when you start reading at the middle of the file. The line end character should allow you to do that.
    If you don't understand my response, don't ignore it, ask a question.

  5. #5
    atomant is offline Member
    Join Date
    May 2010
    Posts
    45
    Rep Power
    0

    Default Re: Searching Through Large Amounts of Data

    I got it working, but its taking too long. The CSV file is 4gb and I have to pull over 1 million rows.

    If I can put my data into 2 columns, each a 10/12 digit long could I use a hashmap? I'm not really sure if this would work, I've never worked with hashmaps.

  6. #6
    atomant is offline Member
    Join Date
    May 2010
    Posts
    45
    Rep Power
    0

    Default Re: Searching Through Large Amounts of Data

    I think I got it working with HashMaps. It's going very fast now

  7. #7
    doWhile is offline Moderator
    Join Date
    Jul 2010
    Location
    California
    Posts
    1,642
    Rep Power
    6

    Default Re: Searching Through Large Amounts of Data

    I have tried SQL and importing it would take way too long
    How are you importing? I hope not every time you need the data. You should only have to do this once, which yes may take a while, but once done and indexed properly should be sufficiently fast.

  8. #8
    atomant is offline Member
    Join Date
    May 2010
    Posts
    45
    Rep Power
    0

    Default Re: Searching Through Large Amounts of Data

    Yes I was just doing it once. I could never figure out why it was going so slow. I tried inserting one row at a time and even using PreparedStatements and Batch Inserts, but making the table would have taken about a month for some reason (2000 rows/min). When I tried importing from csv I kept getting memory errors

  9. #9
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    16,544
    Rep Power
    23

    Default Re: Searching Through Large Amounts of Data

    Strange that you are able to use a Hashmap successfully. How do you keep from reading the whole file and having a memory problem?
    If you don't understand my response, don't ignore it, ask a question.

  10. #10
    doWhile is offline Moderator
    Join Date
    Jul 2010
    Location
    California
    Posts
    1,642
    Rep Power
    6

    Default Re: Searching Through Large Amounts of Data

    Quote Originally Posted by atomant View Post
    Yes I was just doing it once. I could never figure out why it was going so slow. I tried inserting one row at a time and even using PreparedStatements and Batch Inserts, but making the table would have taken about a month for some reason (2000 rows/min). When I tried importing from csv I kept getting memory errors
    Given you have it working, the point is kind of old, but did you try pulling the CSV directly into the SQL database, rather than through the intermediate JDBC (or if your database does not allow CSV, convert into a recognizable format)?

  11. #11
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    11,432
    Rep Power
    18

    Default Re: Searching Through Large Amounts of Data

    Of course we don't actually know what this is for.
    How many times does this one thing have to be searched?
    Is it just a single row from the 177 million in the CSV (which begs the question of 'why')?
    It's just all a bit vague...
    Please do not ask for code as refusal often offends.

  12. #12
    atomant is offline Member
    Join Date
    May 2010
    Posts
    45
    Rep Power
    0

    Default Re: Searching Through Large Amounts of Data

    Quote Originally Posted by Norm View Post
    Strange that you are able to use a Hashmap successfully. How do you keep from reading the whole file and having a memory problem?
    Ya, so I spoke too soon on that I guess. I have two databases; one with 2.5 million rows (that I need to search through about 150million times), and another with 177million rows (that I would only need to search through about 1.7million times). Either way will be equivalent. I was able to load the smaller 2.5 million row db into a hash and I tried using a for loop to search for the same long 200million times and it worked in like 2 seconds. But when I wrote a loop that searched for a different long each time it took over a minute.

    Trying to load the 177million database returned an out of memory error like you thought it would.

    My solution probably lies in finding a way of cutting down on the number of searches I'm making (I have no idea how that would be possible), but I would still like to figure out the best way to search through this data because I think it's possible to do it much faster than I am now.

    Quote Originally Posted by doWhile View Post
    Given you have it working, the point is kind of old, but did you try pulling the CSV directly into the SQL database, rather than through the intermediate JDBC (or if your database does not allow CSV, convert into a recognizable format)?
    I used phpMyAdmin to import the CSV/ZIP file; first I tried the whole file, then breaking it down into chunks. I also tried some scripts that were supposed to help get over the file limits/memory issues, I couldn't get any of them to work. I also tried writing SQL commands to import and couldn't get that working either. I'm brand new to SQL so I'm not sure how to pull the csv directly into the SQL database or how to convert it into SQL. Here is a post I made about the importing problem.

    Quote Originally Posted by Tolls View Post
    Of course we don't actually know what this is for.
    How many times does this one thing have to be searched?
    Is it just a single row from the 177 million in the CSV (which begs the question of 'why')?
    It's just all a bit vague...
    For the 2.5million row database each row will be searched multiple times at different times, I haven't been able to think of a way around this unfortunately. For the 177million row database each of the 1.7million searches would be unique.

  13. #13
    doWhile is offline Moderator
    Join Date
    Jul 2010
    Location
    California
    Posts
    1,642
    Rep Power
    6

    Default Re: Searching Through Large Amounts of Data

    I used phpMyAdmin to import the CSV/ZIP file
    Doing it through php doesn't actually solve the overhead problem of running SQL from another language. Given you've indicated you are using MySQL, you can go straight to the mysql command line utility and use the LOAD DATA INFILE command
    Java Code:
    LOAD DATA INFILE 'myfilepath' INTO TABLE mytable FIELDS TERMINATED BY ',';
    This cuts through the overhead of communicating to SQL from another language (be sure to index the column(s) properly), but you need to be careful about the table and csv file structures (test with a few lines to work it out first)...and you should also keep in mind database normalization - you've provided no indication as far as what the file contains, but often when there is duplicate values the redundancy can be greatly trimmed down through normalization. An alternative (and crude) solution would also be to create your own file index - read the file and store each first column value along with its byte offset in the File. Then when you need to find that line, open up a RandomAccessFile and set its position to the byte offset and read.
    Last edited by doWhile; 06-09-2012 at 02:30 AM.

  14. #14
    atomant is offline Member
    Join Date
    May 2010
    Posts
    45
    Rep Power
    0

    Default Re: Searching Through Large Amounts of Data

    That SQL command worked great, thanks.

    Here is the test code I have for searching through the database, is there a way to do it faster?

    Java Code:
    	public static void searchSQL() {
            String driverName = "org.gjt.mm.mysql.Driver";
            Connection connection = null;
            try {
    			Class.forName(driverName);
    			
    		} catch (ClassNotFoundException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		}
    
            String serverName = "localhost";
            String mydatabase = "PokerRanks4";
            String url = "jdbc:mysql://" + serverName + "/" + mydatabase;                                                                        
            String username = "root";
            String password = "";
            
            try {
    			connection = DriverManager.getConnection(url, username, password);
    		} catch (SQLException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		}
    		
    		
    		/*
    		/////////////// Option 1, Prepared Statement
    		System.out.println("GO!");
    		ResultSet rs = null;
            PreparedStatement pstmt = null;
            String query = "SELECT rank FROM ranks WHERE deckForm = ?";
    		long start = System.currentTimeMillis();
            try {
    			pstmt = connection.prepareStatement(query);             
    			for (int i = 0; i < 1800000; i++) {	        	
            		pstmt.setLong(1, 10111213141516L);
    				rs = pstmt.executeQuery();
    				while (rs.next()) {	        	
    		        	long num = rs.getLong(1);
    		        	//System.out.println(num);
    		        }
    			}
    		} catch (SQLException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		}       	        	
            System.out.println("Total Time: " + (System.currentTimeMillis() - start) / 1000 );
    		///////////////
            */
            
            /////////////// Option 2
    		Statement st = null;
    		long start = System.currentTimeMillis(); 
    		try {
    			st = connection.createStatement();
    			ResultSet rs = null;
    			long deckForm = 1719213456L;				
    			for (int i = 0; i < 100000; i++) {	        
    				rs = st.executeQuery("SELECT rank FROM ranks WHERE deckForm = " + deckForm);
    				while (rs.next()) {
    					long num = rs.getLong(1);
    		        	//System.out.println(num);
    		        }
    			}
    		} catch (SQLException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		}        
    		System.out.println("Total Time: " + (System.currentTimeMillis() - start) / 1000 );
    		///////////////
    		
         
     
    		try {
    			connection.close();
    		} catch (SQLException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		}
    	}

  15. #15
    doWhile is offline Moderator
    Join Date
    Jul 2010
    Location
    California
    Posts
    1,642
    Rep Power
    6

    Default Re: Searching Through Large Amounts of Data

    That SQL command worked great, thanks.
    You are welcome.

    is there a way to do it faster?
    Why? Is it slow? How slow? And you should follow advice given in your other post, lest people become shy to give out further advice for fear it will not be taken: Java + SQL

  16. #16
    atomant is offline Member
    Join Date
    May 2010
    Posts
    45
    Rep Power
    0

    Default Re: Searching Through Large Amounts of Data

    It is taking well over a minute to run through the queries, and some take longer. In my tests I am trying to run 180million selects of the same row on the 2.5million database, and I'm trying to do 1.7million selects of the same row in the 133million row database.

    Which advice did I miss? I used the Prepared Statements and added try/catch

  17. #17
    doWhile is offline Moderator
    Join Date
    Jul 2010
    Location
    California
    Posts
    1,642
    Rep Power
    6

    Default Re: Searching Through Large Amounts of Data

    Over a minute to run all those queries, a single query, or just the method above? Even with the method above, you are running 100000 queries...its going to take some time. FWIW Its best practice to close the ResultSet and Statements (which may or may not hold resources depending upon your database), and make use of the finally clause to do so (closing resources in this clause).

  18. #18
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    11,432
    Rep Power
    18

    Default Re: Searching Through Large Amounts of Data

    Quote Originally Posted by doWhile View Post
    This cuts through the overhead of communicating to SQL from another language (be sure to index the column(s) properly)...
    But only after you've loaded the data.
    :)

    Quote Originally Posted by atomant View Post
    That SQL command worked great, thanks.

    Here is the test code I have for searching through the database, is there a way to do it faster?
    Yes, use the prepared statement version of the code, which appears to be commented out.
    Other than that it could be as doWhile suggests and your database structure needs tweaking.
    Please do not ask for code as refusal often offends.

Similar Threads

  1. Thoughts about how to bit shift large amounts of data
    By whytheheckme in forum New To Java
    Replies: 3
    Last Post: 07-08-2011, 12:48 AM
  2. How to save large amounts of similar objects?
    By nieuwenhuizen-jk in forum New To Java
    Replies: 12
    Last Post: 05-13-2011, 03:21 PM
  3. Large data over RMI
    By JavaDesigner in forum New To Java
    Replies: 7
    Last Post: 10-16-2009, 08:48 PM
  4. Replies: 1
    Last Post: 12-28-2008, 10:25 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •