building a tokenizer
I'm trying to build a tokenizer to break up sentences in words. Now since I find it not sufficient to break up/split string on whitespace, I'd like to give it a few more arguments.
Whitespace being one of them, but also dot followed by whitespace, question mark followed by whitespace etc. (But not a dot alone, since that can be part of an abbreviation for example). So it's actually a mix of (single) characters (whitespace) and strings (punctuation mark followed by whitespace) that I want to split on.
so conceptually I'd figure it looks something like this:
input.split("[", ", ". ", "? ", "! , " "]";
but eclipse doesn't really like that...
I've read the doc for string and pattern/matcher, but it doesn't really help me.
Anybody here who could point me in the right direction?
I needed to do this when I wrote a compiler. I found that Java's REGEX package worked just fine, since I had a combination of whitespace and no whitespace, but needed regular tokens either way.