Results 1 to 3 of 3
- 01-04-2011, 02:24 AM #1
Member
- Join Date
- Jan 2011
- Posts
- 2
- Rep Power
- 0
creating autocomplete index for multiple words (phrases)
I was hoping someone could point me in the right direction to accomplish what I am after:
I want to create an index, based off an index, which will accept an incomplete search string like "elec" and return suggested PHRASES such as "electric guitar" and "electricity and magnetism" etc. based on frequency of phrases in the index.
similarly, "electric gui" would return "electric guitar", etc. (i think we get the idea).
I have done some searching and found an implementation of this, which I have used, but the index it creates uses the standardtokenizer and splits everything up into single words (which isn't good enough for me)
I have modified the analyzer to use the keyword tokenizer, which should include the whitespace characters, and words up to the stop words... but the way the index is created (using a lucenedictionary.getWordsIterator method), I have no way to iterate through PHRASES in a unique fashion...
thus any index i create will either be one which returns only single words, or one which returns phrases, but duplicate phrases...
I think there may also be a way to implement this using shingles instead? but I am not sure on which way to go.... please advise!
my current code is below:
Java Code:public final class Autocompleter { public Autocompleter(String autoCompleteDir) throws IOException { this.autoCompleteDirectory = FSDirectory.open(new File(autoCompleteDir), new NoLockFactory()); //reOpenReader(); } public List<String> suggestTermsFor(String term) throws IOException { // get the top 5 terms for query Query query = new TermQuery(new Term(GRAMMED_WORDS_FIELD, term)); Sort sort = new Sort(new SortField(COUNT_FIELD, SortField.INT, true)); TopDocs docs = autoCompleteSearcher.search(query, null, 5, sort); List<String> suggestions = new ArrayList<String>(); for (ScoreDoc doc : docs.scoreDocs) { suggestions.add(autoCompleteReader.document(doc.doc).get( SOURCE_WORD_FIELD)); } return suggestions; } // INDEXING METHOD WHICH I FOUND ONLINE @SuppressWarnings("unchecked") public void reIndex(Directory sourceDirectory, String fieldToAutocomplete) throws CorruptIndexException, IOException { // build a dictionary (from the spell package) IndexReader sourceReader = IndexReader.open(sourceDirectory); LuceneDictionary dict = new LuceneDictionary(sourceReader, fieldToAutocomplete); // use a custom analyzer so we can do EdgeNGramFiltering IndexWriter writer = new IndexWriter(autoCompleteDirectory, new Analyzer() { public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(Version.LUCENE_30, reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(false, result, StopFilter.makeStopSet(ENGLISH_STOP_WORDS)); result = new EdgeNGramTokenFilter(result, Side.FRONT, 1, 20); return result; } }, true ,IndexWriter.MaxFieldLength.LIMITED); writer.setMergeFactor(300); writer.setMaxBufferedDocs(150); // go through every word, storing the original word (incl. n-grams) // and the number of times it occurs Map<String, Integer> wordsMap = new HashMap<String, Integer>(); Iterator<String> iter = (Iterator<String>) dict.getWordsIterator(); while (iter.hasNext()) { String word = iter.next(); int len = word.length(); if (len < 3) { continue; // too short we bail but "too long" is fine... } if (wordsMap.containsKey(word)) { throw new IllegalStateException( "This should never happen in Lucene 2.3.2"); // wordsMap.put(word, wordsMap.get(word) + 1); } else { // use the number of documents this word appears in wordsMap.put(word, sourceReader.docFreq(new Term( fieldToAutocomplete, word))); } } for (String word : wordsMap.keySet()) { // ok index the word Document doc = new Document(); doc.add(new Field(SOURCE_WORD_FIELD, word, Field.Store.YES, Field.Index.ANALYZED)); // orig term doc.add(new Field(GRAMMED_WORDS_FIELD, word, Field.Store.YES, Field.Index.ANALYZED)); // grammed doc.add(new Field(COUNT_FIELD, Integer.toString(wordsMap.get(word)), Field.Store.NO, Field.Index.NOT_ANALYZED)); // count writer.addDocument(doc); } sourceReader.close(); // close writer writer.optimize(); writer.close(); // re-open our reader reOpenReader(); } // MY START/ATTEMPT AT MULTI-WORD INDEX @SuppressWarnings("unchecked") public void reIndex2(Directory sourceDirectory, String fieldToAutocomplete) throws CorruptIndexException, IOException { // build a dictionary (from the spell package) IndexReader sourceReader = IndexReader.open(sourceDirectory); LuceneDictionary dict = new LuceneDictionary(sourceReader, fieldToAutocomplete); // use a custom analyzer so we can do EdgeNGramFiltering IndexWriter writer = new IndexWriter(autoCompleteDirectory, new Analyzer() { public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new KeywordTokenizer(reader); //result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(false, result, StopFilter.makeStopSet(ENGLISH_STOP_WORDS)); result = new EdgeNGramTokenFilter(result, Side.FRONT, 1, 30); return result; } }, true ,IndexWriter.MaxFieldLength.LIMITED); // NEED TO KNOW WHAT TO DO HERE???? } }
- 01-05-2011, 03:00 AM #2
Member
- Join Date
- Jan 2011
- Posts
- 1
- Rep Power
- 0
Add to this article ...
- 01-05-2011, 10:30 PM #3
Member
- Join Date
- Jan 2011
- Posts
- 2
- Rep Power
- 0
Similar Threads
-
finding idf for noun phrases
By jessie in forum New To JavaReplies: 0Last Post: 11-22-2010, 01:05 PM -
Phrases in Lucene dictionary?
By TheShar in forum LuceneReplies: 0Last Post: 05-27-2010, 02:42 PM -
Accessing an arrayList of multiple items per index.
By scrap in forum New To JavaReplies: 12Last Post: 11-05-2009, 07:12 AM -
Creating multiple local references & impact on performance
By raki.seattle in forum Advanced JavaReplies: 17Last Post: 09-05-2009, 02:39 AM -
Jumble Multiple Words
By wethekings in forum New To JavaReplies: 5Last Post: 02-20-2009, 03:57 PM


LinkBack URL
About LinkBacks

Bookmarks