Results 1 to 6 of 6
  1. #1
    svirid is offline Member
    Join Date
    Oct 2008
    Location
    Toronto, Canada
    Posts
    2
    Rep Power
    0

    Default Lucene Indexer Encoding problem

    Good day guys,

    hope u can help me. I am trying to index French and Russian documents with Lucene and have no luck. I am new in JAVA so basically I really need your help.

    I was able to get text from pdfs, when I save it its all fine I can clearly see russian charachters in txt file but when I add it to the Index its all ??? or other garbage.

    Here is what I do:

    I first use PDF box to extract text.

    Java Code:
    textFile = "c:/java/faq.txt";
    pdfFile = "c:/java/faq.pdf"; 
    
    //FIRST I AM GETTING TEXT FROM PDF
    document = PDDocument.load( pdfFile );
    
    output = new OutputStreamWriter  ( new FileOutputStream  ( textFile ), "UTF-8" );	    
    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setStartPage( 1 );
    stripper.setEndPage( 20 );
    
    //THIS SAVES TEXT INTO THE TXT FILE, TXT FILE COMPLETELY FINE
    stripper.writeText(document, output);
    
    //BUT WHEN I GET TEXT LIKE THAT TO ADD TO THE INDEX
    textData = stripper.getText(document);
    
    
    Analyzer analyzer = new StandardAnalyzer();        
    Directory directory = FSDirectory.getDirectory("c:/java/collection");
    IndexWriter iwriter = new IndexWriter(directory, analyzer, new IndexWriter.MaxFieldLength(250));
    Document doc = new Document();
            
    doc.add(new Field("fieldname", textData, Field.Store.YES, Field.Index.NOT_ANALYZED));
    iwriter.addDocument(doc);
    iwriter.optimize();
    iwriter.close();
    This code above properly saves extracted text to the txt file, whioch I dotn really need. What I want is to get text and add it to the Index right away. When I open index files in notepad I can see garbage instead of russian characters.

    Please help. Thank you

  2. #2
    wolfcro is offline Member
    Join Date
    Feb 2009
    Location
    Italy
    Posts
    51
    Rep Power
    0

    Default Russian analyzer

    Why don't you try to use the RussianAnalyzer (org.apache.lucene.analysis.ru.RussianAnalyzer) instead of the StandardAnalyzer which is made for english.

    if you don't have the RussianAnalyzer already you can find it here:
    svn.apache.org/repos/asf/lucene/java/trunk/contrib/

    i haven't tried it myself, but i hope it will work for you

  3. #3
    svirid is offline Member
    Join Date
    Oct 2008
    Location
    Toronto, Canada
    Posts
    2
    Rep Power
    0

    Default

    Thank you for your reply I will try to do that but I was kinda hopping that Lucene can handle it on its own. The problem is that when I am indexing a document I dont know what language its written with, russian, french or english.

  4. #4
    wolfcro is offline Member
    Join Date
    Feb 2009
    Location
    Italy
    Posts
    51
    Rep Power
    0

    Default

    lucene indexer uses an analyzer to recognize and extract the tokens in a text.

    you can try different analyzers to see which works better for you.... there are many.... so I suggest to try them

    if you don't know the exact language try to find an analyzer that works "good" with all..... but it will not be easy because of language specific encoding.....

    you could also try to first recognize in a manner the language of the text and then use the apropriate analyzer
    (try to look for specific characters ???)

  5. #5
    piyu.sha is offline Member
    Join Date
    Oct 2008
    Posts
    31
    Rep Power
    0

    Default

    IF you try to read the index file you may not be able to relate it. Instead use Luke to browse the Lucene index. It can show ho the data is stored and you can also do some sample searches on same index.


    ---------------------
    Music | HTS | Remix
    Live life king size
    [Lucene]

  6. #6
    wolfcro is offline Member
    Join Date
    Feb 2009
    Location
    Italy
    Posts
    51
    Rep Power
    0

    Default

    we weren't talking about reading the index but creating it....
    but for all there is appropriate api hehe :P

Similar Threads

  1. Lucene Re-Indexing
    By connect2srinath in forum Lucene
    Replies: 1
    Last Post: 05-11-2008, 05:35 PM
  2. Apache Lucene 2.3.2
    By Java Tip in forum Java Software
    Replies: 0
    Last Post: 05-08-2008, 06:49 PM
  3. Some help with encoding...
    By nm123 in forum Networking
    Replies: 0
    Last Post: 04-15-2008, 12:22 AM
  4. Work On Lucene
    By peiceonly in forum Lucene
    Replies: 1
    Last Post: 08-07-2007, 05:47 PM
  5. Apache Lucene 2.2.0
    By JavaBean in forum Java Software
    Replies: 0
    Last Post: 06-22-2007, 12:47 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •