Results 1 to 6 of 6
- 01-13-2012, 05:42 PM #1
Member
- Join Date
- Jun 2011
- Posts
- 10
- Rep Power
- 0
how to preserve whitespaces etc when tokenizing stream?
I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stopwords etc from the input so that the output is formatted in the same way as the input instead of ended up being a stream of translations. So if my input is
<term1>: <term2> <stopword>! <term3>
<term4>
then I want the output to look like
<term1'>: <term2'> <stopword>! <term3'>
<term4'>
(where <termi'> is translation of <termi>) instead of
<term1'> <term2'> <term3'> <term4'>
Currently I am doing the following:
but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert them into the output? thanks much!Java Code:PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31, PatternAnalyzer.WHITESPACE_PATTERN, false, WordlistLoader.getWordSet(new File(stopWordFilePath))); TokenStream ts = pa.tokenStream(null, in); CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class); while (ts.incrementToken()) { // loop over tokens String termIn = charTermAttribute.toString(); ... }
- 01-13-2012, 06:23 PM #2
Re: how to preserve whitespaces etc when tokenizing stream?
Can't you just tokenize on something other than whitespace?
How to Ask Questions the Smart Way
Static Void Games - Play indie games, learn from game tutorials and source code, upload your own games!
- 01-13-2012, 06:27 PM #3
Member
- Join Date
- Jun 2011
- Posts
- 10
- Rep Power
- 0
Re: how to preserve whitespaces etc when tokenizing stream?
But in this case wouldn't the tokens include all the whitespaces etc screwing up their proper look up in the dictionary?
- 01-13-2012, 07:11 PM #4
Re: how to preserve whitespaces etc when tokenizing stream?
How to Ask Questions the Smart Way
Static Void Games - Play indie games, learn from game tutorials and source code, upload your own games!
- 01-13-2012, 07:55 PM #5
Member
- Join Date
- Jun 2011
- Posts
- 10
- Rep Power
- 0
Re: how to preserve whitespaces etc when tokenizing stream?
I can't use replace() since I don't know beforehand what I will be replacing with what, I need to separate out every "word" and then look it up. So this boils down to somehow splitting the original string into a sequence of tokens like this
{ term, separator, term, separator, ... }
where separators would be defined as "not Unicode letters". Do you know
1. How I specify the right regex for "not a Unicode letter"?
2. What can I use to split like this? The regular String.split() loses the separators, just like the tokenizer I used above.
Thanks
- 01-13-2012, 08:40 PM #6
Re: how to preserve whitespaces etc when tokenizing stream?
Like I said, split it up into two steps. Store your big String in some variable. Also store a Set of all the words in that big String, without whitespace. Then do a replaceAll on the big String, using the words in the Set mapped to their translations. If you're talking about a String that's too big to fit in memory, then do it in chunks.
How to Ask Questions the Smart Way
Static Void Games - Play indie games, learn from game tutorials and source code, upload your own games!
Similar Threads
-
Tokenizing VB code from java
By hedonist in forum New To JavaReplies: 8Last Post: 07-16-2010, 12:57 AM -
Trouble with Tokenizing String
By ramathews in forum New To JavaReplies: 0Last Post: 03-30-2010, 02:19 PM -
String tokenizing with Scanner
By vijaygk in forum Advanced JavaReplies: 2Last Post: 07-15-2008, 04:44 AM -
Tokenizing with Scanner
By sireesha in forum New To JavaReplies: 3Last Post: 02-05-2008, 08:44 PM -
how to remove whitespaces in a text
By christina in forum New To JavaReplies: 2Last Post: 08-03-2007, 05:24 PM


LinkBack URL
About LinkBacks
Reply With Quote

Bookmarks