Results 1 to 6 of 6
  1. #1
    ilyaz is offline Member
    Join Date
    Jun 2011
    Posts
    10
    Rep Power
    0

    Default how to preserve whitespaces etc when tokenizing stream?

    I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stopwords etc from the input so that the output is formatted in the same way as the input instead of ended up being a stream of translations. So if my input is

    <term1>: <term2> <stopword>! <term3>
    <term4>


    then I want the output to look like

    <term1'>: <term2'> <stopword>! <term3'>
    <term4'>


    (where <termi'> is translation of <termi>) instead of

    <term1'> <term2'> <term3'> <term4'>

    Currently I am doing the following:

    Java Code:
    PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
            						 PatternAnalyzer.WHITESPACE_PATTERN,
            						 false, 
            						 WordlistLoader.getWordSet(new File(stopWordFilePath)));
    TokenStream ts = pa.tokenStream(null, in);
    CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);
    
    while (ts.incrementToken()) { // loop over tokens
           String termIn = charTermAttribute.toString(); 
           ...
    }
    but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert them into the output? thanks much!

  2. #2
    KevinWorkman's Avatar
    KevinWorkman is offline Crazy Cat Lady
    Join Date
    Oct 2010
    Location
    Washington, DC
    Posts
    3,691
    Rep Power
    8

    Default Re: how to preserve whitespaces etc when tokenizing stream?

    Can't you just tokenize on something other than whitespace?
    How to Ask Questions the Smart Way
    Static Void Games - Play indie games, learn from game tutorials and source code, upload your own games!

  3. #3
    ilyaz is offline Member
    Join Date
    Jun 2011
    Posts
    10
    Rep Power
    0

    Default Re: how to preserve whitespaces etc when tokenizing stream?

    But in this case wouldn't the tokens include all the whitespaces etc screwing up their proper look up in the dictionary?

  4. #4
    KevinWorkman's Avatar
    KevinWorkman is offline Crazy Cat Lady
    Join Date
    Oct 2010
    Location
    Washington, DC
    Posts
    3,691
    Rep Power
    8

    Default Re: how to preserve whitespaces etc when tokenizing stream?

    Quote Originally Posted by ilyaz View Post
    But in this case wouldn't the tokens include all the whitespaces etc screwing up their proper look up in the dictionary?
    Well, you could have two steps: split the words up and include the whitespace, then from there do the replace, maintaining the whitespace.

    You might be over-complicating things though. For your purposes, wouldn't a simple String.replace() work?
    How to Ask Questions the Smart Way
    Static Void Games - Play indie games, learn from game tutorials and source code, upload your own games!

  5. #5
    ilyaz is offline Member
    Join Date
    Jun 2011
    Posts
    10
    Rep Power
    0

    Default Re: how to preserve whitespaces etc when tokenizing stream?

    Quote Originally Posted by KevinWorkman View Post
    You might be over-complicating things though. For your purposes, wouldn't a simple String.replace() work?
    I can't use replace() since I don't know beforehand what I will be replacing with what, I need to separate out every "word" and then look it up. So this boils down to somehow splitting the original string into a sequence of tokens like this

    { term, separator, term, separator, ... }

    where separators would be defined as "not Unicode letters". Do you know
    1. How I specify the right regex for "not a Unicode letter"?
    2. What can I use to split like this? The regular String.split() loses the separators, just like the tokenizer I used above.

    Thanks

  6. #6
    KevinWorkman's Avatar
    KevinWorkman is offline Crazy Cat Lady
    Join Date
    Oct 2010
    Location
    Washington, DC
    Posts
    3,691
    Rep Power
    8

    Default Re: how to preserve whitespaces etc when tokenizing stream?

    Like I said, split it up into two steps. Store your big String in some variable. Also store a Set of all the words in that big String, without whitespace. Then do a replaceAll on the big String, using the words in the Set mapped to their translations. If you're talking about a String that's too big to fit in memory, then do it in chunks.
    How to Ask Questions the Smart Way
    Static Void Games - Play indie games, learn from game tutorials and source code, upload your own games!

Similar Threads

  1. Tokenizing VB code from java
    By hedonist in forum New To Java
    Replies: 8
    Last Post: 07-16-2010, 12:57 AM
  2. Trouble with Tokenizing String
    By ramathews in forum New To Java
    Replies: 0
    Last Post: 03-30-2010, 02:19 PM
  3. String tokenizing with Scanner
    By vijaygk in forum Advanced Java
    Replies: 2
    Last Post: 07-15-2008, 04:44 AM
  4. Tokenizing with Scanner
    By sireesha in forum New To Java
    Replies: 3
    Last Post: 02-05-2008, 08:44 PM
  5. how to remove whitespaces in a text
    By christina in forum New To Java
    Replies: 2
    Last Post: 08-03-2007, 05:24 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •