how to preserve whitespaces etc when tokenizing stream?
I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stopwords etc from the input so that the output is formatted in the same way as the input instead of ended up being a stream of translations. So if my input is
<term1>: <term2> <stopword>! <term3>
<term4>
then I want the output to look like
<term1'>: <term2'> <stopword>! <term3'>
<term4'>
(where <termi'> is translation of <termi>) instead of
<term1'> <term2'> <term3'> <term4'>
Currently I am doing the following:
Code:
PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
PatternAnalyzer.WHITESPACE_PATTERN,
false,
WordlistLoader.getWordSet(new File(stopWordFilePath)));
TokenStream ts = pa.tokenStream(null, in);
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);
while (ts.incrementToken()) { // loop over tokens
String termIn = charTermAttribute.toString();
...
}
but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert them into the output? thanks much!
Re: how to preserve whitespaces etc when tokenizing stream?
Can't you just tokenize on something other than whitespace?
Re: how to preserve whitespaces etc when tokenizing stream?
But in this case wouldn't the tokens include all the whitespaces etc screwing up their proper look up in the dictionary?
Re: how to preserve whitespaces etc when tokenizing stream?
Quote:
Originally Posted by
ilyaz
But in this case wouldn't the tokens include all the whitespaces etc screwing up their proper look up in the dictionary?
Well, you could have two steps: split the words up and include the whitespace, then from there do the replace, maintaining the whitespace.
You might be over-complicating things though. For your purposes, wouldn't a simple String.replace() work?
Re: how to preserve whitespaces etc when tokenizing stream?
Quote:
Originally Posted by
KevinWorkman
You might be over-complicating things though. For your purposes, wouldn't a simple String.replace() work?
I can't use replace() since I don't know beforehand what I will be replacing with what, I need to separate out every "word" and then look it up. So this boils down to somehow splitting the original string into a sequence of tokens like this
{ term, separator, term, separator, ... }
where separators would be defined as "not Unicode letters". Do you know
1. How I specify the right regex for "not a Unicode letter"?
2. What can I use to split like this? The regular String.split() loses the separators, just like the tokenizer I used above.
Thanks
Re: how to preserve whitespaces etc when tokenizing stream?
Like I said, split it up into two steps. Store your big String in some variable. Also store a Set of all the words in that big String, without whitespace. Then do a replaceAll on the big String, using the words in the Set mapped to their translations. If you're talking about a String that's too big to fit in memory, then do it in chunks.