Results 1 to 6 of 6
Thread: Lucene Indexer Encoding problem
- 10-13-2008, 06:57 PM #1
Member
- Join Date
- Oct 2008
- Location
- Toronto, Canada
- Posts
- 2
- Rep Power
- 0
Lucene Indexer Encoding problem
Good day guys,
hope u can help me. I am trying to index French and Russian documents with Lucene and have no luck. I am new in JAVA so basically I really need your help.
I was able to get text from pdfs, when I save it its all fine I can clearly see russian charachters in txt file but when I add it to the Index its all ??? or other garbage.
Here is what I do:
I first use PDF box to extract text.
This code above properly saves extracted text to the txt file, whioch I dotn really need. What I want is to get text and add it to the Index right away. When I open index files in notepad I can see garbage instead of russian characters.Java Code:textFile = "c:/java/faq.txt"; pdfFile = "c:/java/faq.pdf"; //FIRST I AM GETTING TEXT FROM PDF document = PDDocument.load( pdfFile ); output = new OutputStreamWriter ( new FileOutputStream ( textFile ), "UTF-8" ); PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage( 1 ); stripper.setEndPage( 20 ); //THIS SAVES TEXT INTO THE TXT FILE, TXT FILE COMPLETELY FINE stripper.writeText(document, output); //BUT WHEN I GET TEXT LIKE THAT TO ADD TO THE INDEX textData = stripper.getText(document); Analyzer analyzer = new StandardAnalyzer(); Directory directory = FSDirectory.getDirectory("c:/java/collection"); IndexWriter iwriter = new IndexWriter(directory, analyzer, new IndexWriter.MaxFieldLength(250)); Document doc = new Document(); doc.add(new Field("fieldname", textData, Field.Store.YES, Field.Index.NOT_ANALYZED)); iwriter.addDocument(doc); iwriter.optimize(); iwriter.close();
Please help. Thank you
- 02-09-2009, 06:06 PM #2
Member
- Join Date
- Feb 2009
- Location
- Italy
- Posts
- 51
- Rep Power
- 0
Russian analyzer
Why don't you try to use the RussianAnalyzer (org.apache.lucene.analysis.ru.RussianAnalyzer) instead of the StandardAnalyzer which is made for english.
if you don't have the RussianAnalyzer already you can find it here:
svn.apache.org/repos/asf/lucene/java/trunk/contrib/
i haven't tried it myself, but i hope it will work for you
- 02-09-2009, 08:19 PM #3
Member
- Join Date
- Oct 2008
- Location
- Toronto, Canada
- Posts
- 2
- Rep Power
- 0
Thank you for your reply I will try to do that but I was kinda hopping that Lucene can handle it on its own. The problem is that when I am indexing a document I dont know what language its written with, russian, french or english.
- 02-10-2009, 10:31 AM #4
Member
- Join Date
- Feb 2009
- Location
- Italy
- Posts
- 51
- Rep Power
- 0
lucene indexer uses an analyzer to recognize and extract the tokens in a text.
you can try different analyzers to see which works better for you.... there are many.... so I suggest to try them
if you don't know the exact language try to find an analyzer that works "good" with all..... but it will not be easy because of language specific encoding.....
you could also try to first recognize in a manner the language of the text and then use the apropriate analyzer
(try to look for specific characters ???)
- 02-13-2009, 07:01 AM #5
Member
- Join Date
- Oct 2008
- Posts
- 31
- Rep Power
- 0
- 02-18-2009, 09:26 AM #6
Member
- Join Date
- Feb 2009
- Location
- Italy
- Posts
- 51
- Rep Power
- 0
Similar Threads
-
Lucene Re-Indexing
By connect2srinath in forum LuceneReplies: 1Last Post: 05-11-2008, 05:35 PM -
Apache Lucene 2.3.2
By Java Tip in forum Java SoftwareReplies: 0Last Post: 05-08-2008, 06:49 PM -
Some help with encoding...
By nm123 in forum NetworkingReplies: 0Last Post: 04-15-2008, 12:22 AM -
Work On Lucene
By peiceonly in forum LuceneReplies: 1Last Post: 08-07-2007, 05:47 PM -
Apache Lucene 2.2.0
By JavaBean in forum Java SoftwareReplies: 0Last Post: 06-22-2007, 12:47 AM


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks