Results 1 to 3 of 3
- 08-13-2011, 07:29 PM #1
Member
- Join Date
- Jan 2011
- Posts
- 71
- Rep Power
- 0
Problem with Scanner - using delimiters
Hello folks
First, this isn't a homework project and in fact is just a pet project of mine. Problem I have is as follows:
I have a large email list which has been provided to me by a third party. The third party doesn't have any validation on their email field so the end users can input any old rubbish. The data has been supplied to me in a *.csv file. Now here's the steps:
1. remove duplicates. Doddle, just read in the *.csv file into a HashSet.
1. fix syntax errors - now here lies the issues.
I have examples of emails that are as follows:
abc@somewhere.com
abc@somewhereelse,com
First example is the happy path and I can deal with that. The second on the other hand is where my problem lies. I'm already using the "," as the delimiter so when populating the HashSet the second example gives me "abc@somewherelse". With the large array of main domains out there I can't see how I can get the full email into the set which I can then correct (substitute the comma with a full stop). Any ideas? Is there any way I can implement an excape of the comma building back from the domain but not on the comma at the end of each entry? Note, there may be more than one comma in each email, but I have a plan to deal with those.
Just to be clear, It's obvious from looking at the email addresses that the second example is nothing more than a typo. There are other entries in the file that are clearly nonsense and they will be dropped.
Any advice would be appreciated.
Thanks
- 08-15-2011, 11:26 AM #2
Senior Member
- Join Date
- Jun 2008
- Posts
- 2,366
- Rep Power
- 7
Google for a CSV library (there are a number of them out there) rather than using Scanner. Also, hopefully, the CSV is valid (i.e. that email containing the comma is hopefully surrounded by quotes), otherwise there is nothing you can do anyway.
- 08-15-2011, 07:49 PM #3
Member
- Join Date
- Jan 2011
- Posts
- 71
- Rep Power
- 0
Hi masijade
Thanks for the reply. I looked further into the data provided and managed to get round the issue by using the CRLF instead. The person that created the files obviously didn't think to hard about the quality, which in a weird way helped me out. Anyhoo, the good news is that the code worked and out of 32k records they only have 200 with quality issues (the commas being the least of their worries and I managed to fix most of them). A change request has been issues for validation on the front end....Happy days.
Similar Threads
-
Problem with the Scanner
By Maretaga in forum New To JavaReplies: 6Last Post: 07-14-2011, 09:14 AM -
Tokens, delimiters, and all that jazz
By nisim777 in forum New To JavaReplies: 5Last Post: 04-18-2011, 01:07 AM -
String Tokenizer, no delimiters
By fuzzdn in forum New To JavaReplies: 3Last Post: 12-30-2010, 02:56 PM -
parsing multiple delimiters
By meshhat in forum New To JavaReplies: 3Last Post: 04-19-2009, 12:51 AM -
Read file delimiters
By GraemeH in forum New To JavaReplies: 4Last Post: 03-29-2009, 11:44 AM


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks