Results 1 to 3 of 3
  1. #1
    lrichardson is offline Member
    Join Date
    Jan 2011
    Posts
    2
    Rep Power
    0

    Default creating autocomplete index for multiple words (phrases)

    I was hoping someone could point me in the right direction to accomplish what I am after:

    I want to create an index, based off an index, which will accept an incomplete search string like "elec" and return suggested PHRASES such as "electric guitar" and "electricity and magnetism" etc. based on frequency of phrases in the index.

    similarly, "electric gui" would return "electric guitar", etc. (i think we get the idea).

    I have done some searching and found an implementation of this, which I have used, but the index it creates uses the standardtokenizer and splits everything up into single words (which isn't good enough for me)


    I have modified the analyzer to use the keyword tokenizer, which should include the whitespace characters, and words up to the stop words... but the way the index is created (using a lucenedictionary.getWordsIterator method), I have no way to iterate through PHRASES in a unique fashion...

    thus any index i create will either be one which returns only single words, or one which returns phrases, but duplicate phrases...

    I think there may also be a way to implement this using shingles instead? but I am not sure on which way to go.... please advise!

    my current code is below:



    Java Code:
    public final class Autocompleter {
        public Autocompleter(String autoCompleteDir) throws IOException {
    
        	this.autoCompleteDirectory = FSDirectory.open(new File(autoCompleteDir), new NoLockFactory());
        	//reOpenReader();
        }
    
        public List<String> suggestTermsFor(String term) throws IOException {
        	// get the top 5 terms for query
        	Query query = new TermQuery(new Term(GRAMMED_WORDS_FIELD, term));
        	Sort sort = new Sort(new SortField(COUNT_FIELD, SortField.INT, true));
    
        	TopDocs docs = autoCompleteSearcher.search(query, null, 5, sort);
    
        	List<String> suggestions = new ArrayList<String>();
        	for (ScoreDoc doc : docs.scoreDocs) {
        		suggestions.add(autoCompleteReader.document(doc.doc).get(
        				SOURCE_WORD_FIELD));
        	}
    
        	return suggestions;
        }
    
    	// INDEXING METHOD WHICH I FOUND ONLINE
        @SuppressWarnings("unchecked")
        public void reIndex(Directory sourceDirectory, String fieldToAutocomplete)
        		throws CorruptIndexException, IOException {
    		
        	// build a dictionary (from the spell package)
        	IndexReader sourceReader = IndexReader.open(sourceDirectory);
    
        	LuceneDictionary dict = new LuceneDictionary(sourceReader,
        			fieldToAutocomplete);
    
        	// use a custom analyzer so we can do EdgeNGramFiltering
        	IndexWriter writer = new IndexWriter(autoCompleteDirectory,
        	new Analyzer() {
        		public TokenStream tokenStream(String fieldName,
        				Reader reader) {
        			TokenStream result = new StandardTokenizer(Version.LUCENE_30, reader);
    
        			result = new StandardFilter(result);
        			result = new LowerCaseFilter(result);
        			result = new StopFilter(false, result,
        				StopFilter.makeStopSet(ENGLISH_STOP_WORDS));
        			result = new EdgeNGramTokenFilter(result, Side.FRONT, 1, 20);
    
        			return result;
        		}
        	}, true ,IndexWriter.MaxFieldLength.LIMITED);
    
        	writer.setMergeFactor(300);
        	writer.setMaxBufferedDocs(150);
    
        	// go through every word, storing the original word (incl. n-grams)
        	// and the number of times it occurs
        	Map<String, Integer> wordsMap = new HashMap<String, Integer>();
    
        	Iterator<String> iter = (Iterator<String>) dict.getWordsIterator();
        	while (iter.hasNext()) {
        		String word = iter.next();
    
        		int len = word.length();
        		if (len < 3) {
        			continue; // too short we bail but "too long" is fine...
        		}
    
        		if (wordsMap.containsKey(word)) {
        			throw new IllegalStateException(
        					"This should never happen in Lucene 2.3.2");
        			// wordsMap.put(word, wordsMap.get(word) + 1);
        		} else {
        			// use the number of documents this word appears in
        			wordsMap.put(word, sourceReader.docFreq(new Term(
        					fieldToAutocomplete, word)));
        		}
        	}
    
        	for (String word : wordsMap.keySet()) {
        		// ok index the word
        		Document doc = new Document();
        		doc.add(new Field(SOURCE_WORD_FIELD, word, Field.Store.YES,
        				Field.Index.ANALYZED)); // orig term
        		doc.add(new Field(GRAMMED_WORDS_FIELD, word, Field.Store.YES,
        				Field.Index.ANALYZED)); // grammed
        		doc.add(new Field(COUNT_FIELD,
        				Integer.toString(wordsMap.get(word)), Field.Store.NO,
        				Field.Index.NOT_ANALYZED)); // count
    
        		writer.addDocument(doc);
        	}
    
        	sourceReader.close();
    
        	// close writer
        	writer.optimize();
        	writer.close();
    
        	// re-open our reader
        	reOpenReader();
        }
    	
    	// MY START/ATTEMPT AT MULTI-WORD INDEX
        @SuppressWarnings("unchecked")
        public void reIndex2(Directory sourceDirectory, String fieldToAutocomplete)
        		throws CorruptIndexException, IOException {
        	// build a dictionary (from the spell package)
        	IndexReader sourceReader = IndexReader.open(sourceDirectory);
    
        	LuceneDictionary dict = new LuceneDictionary(sourceReader,
        			fieldToAutocomplete);
    				
        	// use a custom analyzer so we can do EdgeNGramFiltering
        	IndexWriter writer = new IndexWriter(autoCompleteDirectory,
        	new Analyzer() {
        		public TokenStream tokenStream(String fieldName,
        				Reader reader) {
        			TokenStream result = new KeywordTokenizer(reader);
    
        			//result = new StandardFilter(result);
        			result = new LowerCaseFilter(result);
        			result = new StopFilter(false, result,
        				StopFilter.makeStopSet(ENGLISH_STOP_WORDS));
        			result = new EdgeNGramTokenFilter(result, Side.FRONT, 1, 30);
    
        			return result;
        		}
        	}, true ,IndexWriter.MaxFieldLength.LIMITED);
    
        	// NEED TO KNOW WHAT TO DO HERE????
        }
    }

  2. #2
    tianxiad is offline Member
    Join Date
    Jan 2011
    Posts
    1
    Rep Power
    0

    Default

    Add to this article ...

  3. #3
    lrichardson is offline Member
    Join Date
    Jan 2011
    Posts
    2
    Rep Power
    0

    Default

    huh?

    what article?

    does anyone have any ideas? how do people normally create auto-complete functions? this has to be a common need...

Similar Threads

  1. finding idf for noun phrases
    By jessie in forum New To Java
    Replies: 0
    Last Post: 11-22-2010, 02:05 PM
  2. Phrases in Lucene dictionary?
    By TheShar in forum Lucene
    Replies: 0
    Last Post: 05-27-2010, 03:42 PM
  3. Replies: 12
    Last Post: 11-05-2009, 08:12 AM
  4. Creating multiple local references & impact on performance
    By raki.seattle in forum Advanced Java
    Replies: 17
    Last Post: 09-05-2009, 03:39 AM
  5. Jumble Multiple Words
    By wethekings in forum New To Java
    Replies: 5
    Last Post: 02-20-2009, 04:57 PM

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •