Using Lucene to get term-document matrix
I'm trying to do a bit of text analysis with Lucene. At the moment, I'm successfully outputing the term-document matrix for the indexed corpus. However, I'm having trouble implementing one specific utility.
The program returns the term-document matrix, where the terms are single words (stemmed with the porter stemmer). I need to add the ability to include important, user-specified bi and trigrams in the term-document matrix. In short, we'd like to be able to get as output a tdm where the terms are single words (ideally stemmed) + user specified word strings of length greater than 1 word.
I've spent a lot of time trying to figure this out. I'm using PyLucene and unfortunately don't know a lot of Java. As such, the underlying Java code is a tedious read that I've avoided as much as possible.
Is there an easy way to do this? Should I include some code samples?
Thanks in advance.