Results 1 to 1 of 1
- 12-02-2011, 04:56 PM #1
Member
- Join Date
- Jun 2011
- Posts
- 10
- Rep Power
- 0
Design qs: search for multiple terms in document collection
I am trying to make some high- (and not so high) level design decisions for my app that is supposed to check a collection of documents against a set of terms/queries. Basically, I need to perform a triage of sorts when I would find only those docs in the collection which have occurrences of at least one term from the term list. For those docs, I also need to find where in the document each occurrence is, since I then need to collect a small amount of surrounding text for a more detailed analysis.
Clearly, I will need to index the document collection using indexing classes of Lucene. This is pretty straightforward.
Then I will need to use the highlighting classes. In some sample code I found online, a query is first searched for and hits are returned. Then docids are extracted for the hits and query is highlighted. Some questions:
Q1: Does Lucene perform essentially the same searching operation twice, first to find hits, then to highlight? If so, does this mean that if I expect most of the docs in my collection to contain at least one of the search terms, it might be faster for me to skip searching and simply go over all docs, applying highlighting? Then for those docs where no hits occurred I would simply get an empty list of relevant fragments.
Q2: Is the same scoring mechanism used during search and during highlighting? That is, can I be sure that if I get a hit during search, the corresponding document indeed contains my query that will then be found during highlighting?
Q3: Are there any mechanisms in Lucene that would facilitate merging of highlighting results for two different queries against a single document?
Q4: I did some small tests of highlighting and noticed that some of the fragments returned for a query contained highlighted text that was quite far from the original query. For instance, I was looking for a 3-word term and it highlighted a sequence of only 2 of these 3 words. How can I control how close highlighted fragments should be to the original query?
If you can shine some light on any of these questions, I'll be very grateful!
Similar Threads
-
Any advice - java search algorithms which accept multiple search parameters
By Alfster in forum New To JavaReplies: 4Last Post: 03-24-2011, 11:50 PM -
Search and relevance by document name and content
By morofiler in forum LuceneReplies: 0Last Post: 02-27-2011, 02:51 PM -
Error while attempting to validate an XML document against multiple XML schemas
By qulodi in forum XMLReplies: 0Last Post: 12-13-2010, 06:52 PM -
Picking a Collection and Design for querying 1G Point objects
By ennuages in forum Advanced JavaReplies: 8Last Post: 11-22-2010, 10:09 AM -
Design search results .jsp
By maas in forum JavaServer Pages (JSP) and JSTLReplies: 3Last Post: 07-19-2010, 07:18 AM


LinkBack URL
About LinkBacks
Reply With Quote

Bookmarks