Results 1 to 1 of 1
- 12-06-2011, 11:40 PM #1Member
- Join Date
- Jun 2011
- Rep Power
tokenizing text using language analyzer but preserving stopwords if possible
I need to implement a "quick and dirty" or "poor man's" translation of a foreign language document by looking up each word in a dictionary and replacing it with the English translation. So what I need is to tokenize the original foreign text into words and then access each word, look it up and get its translation. However, if possible, I also need to preserve "non-words", i.e. stopwords so that I could replicate them in the output stream without translating. If the latter is not possible then I just need to preserve the order of the original words so that their translations have the same order in the output.
Can I accomplish this using Lucene components? I presume I'd have to start by creating an analyzer for the foreign language, but then what? How do I (i) tokenize, (ii) access words in the correct order, (iii) also access non-words if possible?
- By fullHD in forum LuceneReplies: 0Last Post: 08-04-2011, 06:01 PM
- By manoj41.k in forum New To JavaReplies: 2Last Post: 12-20-2010, 05:20 AM
- By Andrey in forum Web FrameworksReplies: 1Last Post: 10-24-2009, 02:30 AM
- By BeMathis in forum XMLReplies: 0Last Post: 10-14-2009, 05:52 PM