Results 1 to 8 of 8
  1. #1
    sixsigma1978 is offline Member
    Join Date
    May 2011
    Posts
    4
    Rep Power
    0

    Default removing duplicate strings from a massive array efficiently?

    I'm considering the best possible way to remove duplicates from an (Unsorted) array of strings - the array contains millions or tens of millions of strings.

    What would be the best possible way to achieve this. I can think of using HashSet (or LinkedHashSet to keep the order) - but if I have to do it programattically,
    what would be a good approach?

    I was thinking along the lines of doing a sort and then binary search to get a log(n) search instead of n (linear) search. But am not able to put together a code for this.

    Please help! Looking for an answer that addresses both speed and memory since there are millions of strings involved

  2. #2
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    17,812
    Rep Power
    25

    Default Re: removing duplicate strings from a massive array efficiently?

    How are the Strings loaded into the array? Can dups be detected at loading time?
    If you don't understand my response, don't ignore it, ask a question.

  3. #3
    Diargg is offline Senior Member
    Join Date
    Feb 2012
    Posts
    117
    Rep Power
    0

    Default Re: removing duplicate strings from a massive array efficiently?

    In addition to Norm's question, how sparse would the array be after removing dupes? ie are the majority dupes? Or is the dupe rate perhaps 1 in 50(00)? Load-time detection would work great if you have a ton of dupes - the array you're checking against never grows too large. If it typically is growing, then you still cut your computations in half at least.

  4. #4
    sixsigma1978 is offline Member
    Join Date
    May 2011
    Posts
    4
    Rep Power
    0

    Default Re: removing duplicate strings from a massive array efficiently?

    Quote Originally Posted by Diargg View Post
    In addition to Norm's question, how sparse would the array be after removing dupes? ie are the majority dupes? Or is the dupe rate perhaps 1 in 50(00)? Load-time detection would work great if you have a ton of dupes - the array you're checking against never grows too large. If it typically is growing, then you still cut your computations in half at least.
    The array is already prepopulated, so no change of fixing dups at load time. - The selectivity (uniqueness) is pretty low - I would say an array with 10 million rows and 5000 dups.
    Last edited by sixsigma1978; 04-05-2012 at 12:15 AM.

  5. #5
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    17,812
    Rep Power
    25

    Default Re: removing duplicate strings from a massive array efficiently?

    I'm curious: How does an array get pre-populated? Is it full when the program first starts execution? Are the lines in the source and are compiled into the program?
    If you don't understand my response, don't ignore it, ask a question.

  6. #6
    Daslee's Avatar
    Daslee is offline Member
    Join Date
    Mar 2012
    Location
    Plunge, Lithuania
    Posts
    36
    Rep Power
    0

    Default Re: removing duplicate strings from a massive array efficiently?

    To remove duplicates you can use this code:

    Java Code:
    ArrayList<String> arraylist = new ArrayList<String>(); //This is your main array list
    ArrayList<String> newarraylist = new ArrayList<String>();
    
    for(int i=0; i<arraylist.size(); i++){
    	if(newarraylist.contains(arraylist.get(i)) == false){
    		newarraylist.add(arraylist.get(i));
    	}
    }
    
    arraylist = newarraylist;

  7. #7
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    17,812
    Rep Power
    25

    Default Re: removing duplicate strings from a massive array efficiently?

    Another idea: Use a HashSet. The add() method will tell you if the item is already in the HashSet.
    If you don't understand my response, don't ignore it, ask a question.

  8. #8
    sixsigma1978 is offline Member
    Join Date
    May 2011
    Posts
    4
    Rep Power
    0

    Default Re: removing duplicate strings from a massive array efficiently?

    Quote Originally Posted by Norm View Post
    Another idea: Use a HashSet. The add() method will tell you if the item is already in the HashSet.
    true - I know the LinkedHashSet class will do the trick and it works. It will preserve the order -
    I'm more curious to see if there is a programmatic way thats equally or more efficient without using the Collections API

    One logic I was thinking off is sorting the array and checking for duplicate strings in neighborhood.
    I'm trying to prevent the O(n^ 2) search which will happen with this idea though. is there a better suggestion?
    I think by sorting and searching adjacent - I'll get O(nlogn) search which is still better than the previous.

    The goal im trying to reach is if I can get to the ideal O(n) search - which is naturally the fastest
    Last edited by sixsigma1978; 04-06-2012 at 01:31 AM.

Similar Threads

  1. Removing escaped characters from strings
    By Opid in forum New To Java
    Replies: 1
    Last Post: 10-06-2011, 02:42 AM
  2. removing duplicate numbers from an array
    By ozzyman in forum New To Java
    Replies: 1
    Last Post: 03-14-2011, 09:22 PM
  3. Error if array contains duplicate integers
    By lithium002 in forum New To Java
    Replies: 4
    Last Post: 12-05-2009, 09:58 AM
  4. removing duplicate whitespace
    By loki in forum New To Java
    Replies: 1
    Last Post: 04-25-2009, 06:54 PM
  5. Counting Duplicate Variables in an Array
    By Npcomplete in forum New To Java
    Replies: 2
    Last Post: 10-24-2008, 08:33 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •