Results 1 to 1 of 1
- 08-16-2012, 03:39 AM #1Member
- Join Date
- Aug 2012
- Rep Power
Using Lucene to get term-document matrix
I'm trying to do a bit of text analysis with Lucene. At the moment, I'm successfully outputing the term-document matrix for the indexed corpus. However, I'm having trouble implementing one specific utility.
The program returns the term-document matrix, where the terms are single words (stemmed with the porter stemmer). I need to add the ability to include important, user-specified bi and trigrams in the term-document matrix. In short, we'd like to be able to get as output a tdm where the terms are single words (ideally stemmed) + user specified word strings of length greater than 1 word.
I've spent a lot of time trying to figure this out. I'm using PyLucene and unfortunately don't know a lot of Java. As such, the underlying Java code is a tedious read that I've avoided as much as possible.
Is there an easy way to do this? Should I include some code samples?
Thanks in advance.
- By luismpaiva in forum LuceneReplies: 0Last Post: 03-22-2012, 12:17 AM
- By koolmelee in forum LuceneReplies: 1Last Post: 09-17-2011, 06:59 PM
- By aneuryzma in forum LuceneReplies: 0Last Post: 03-01-2011, 12:48 PM
- By MrUni in forum LuceneReplies: 0Last Post: 12-14-2010, 03:00 PM
- By email@example.com in forum LuceneReplies: 1Last Post: 11-04-2009, 05:58 PM