Results 1 to 2 of 2
  1. #1
    gmuresan is offline Member
    Join Date
    Jun 2011
    Posts
    2
    Rep Power
    0

    Default Score combination - Filtering vs. Querying

    The issue that I have is well exemplified by section 3.4.5 "Combining queries: BooleanQuery" in LIA, 2nd ed. The example uses BooleanQuery to combine
    - a TermQuery, for matching document topic, for which the TF-IDF scoring makes sense; and
    - a NumericRangeQuery, whose purpose is to filter by publication date.

    I extended the example code to output the query and the explanation:

    Title AND Date = +subject:search +pubmonth:[201001 TO 201012]
    ----------
    Lucene in Action, Second Edition
    1.6848878 = (MATCH) sum of:
    1.3560408 = (MATCH) weight(subject:search in 9), product of:
    0.9443832 = queryWeight(subject:search), product of:
    2.871802 = idf(docFreq=1, maxDocs=13)
    0.3288469 = queryNorm
    1.435901 = (MATCH) fieldWeight(subject:search in 9), product of:
    1.0 = tf(termFreq(subject:search)=1)
    2.871802 = idf(docFreq=1, maxDocs=13)
    0.5 = fieldNorm(field=subject, doc=9)
    0.3288469 = (MATCH) ConstantScoreQuery(pubmonth:[201001 TO 201012]), product of:
    1.0 = boost
    0.3288469 = queryNorm

    Computing a queryNorm for the NumericRangeQuery has no meaning. Instead of simply filtering by date, this component contributes a substantial amount (0.3288) to the overall score (especially if the title match has a low score).

    In my own (inherited) application I have multiple textual queries, matching against different fields, combined with several NumericRangeQueries. The contributions of the latter to the scores makes it hard to control boosts of different fields.

    The logical course of action seems to me to replace the NumericRangeQueries with filters. This means removing the NumericRangeQueries from the overall BooleanQuery and separately build a filter that combines corresponding NumericRangeFilters. Several options that I have are:
    - Use BooleanFilter
    - Use ChainFilter
    - In order to change as little code as possible, keep the code that combines all NumericRangeQueries into a BooleanQuery, and wrap that in a QueryWrapperFilter.

    Q1: Are there any (performance ?) advantages or disadvantages for each of these options ?
    Q2: Are there any plans to improve Lucene in terms of dealing in a principled way with this issue of combining TermQueries and NumericRangeQueries ?

    Thanks !

  2. #2
    gmuresan is offline Member
    Join Date
    Jun 2011
    Posts
    2
    Rep Power
    0

    Default Say Yes to Filtering

    ...
    I've read more forum discussions on this issue and some people point out (like LIA 2nd ed, p.183, does) that using a filter reduces the number of documents under consideration and impacts IDF and therefore the overall score. Moreover, the recommendation in such forum discussions is that, unless a high performance gain can be obtained via CachingWrapperFilter, MUST BooleanClauses are preferred to Filters.

    This doesn't quite make sense to me: the number of documents in the collection, the size of the vocabulary, the size of each posting list and the IDF of each term are known after indexing and should not be affected by filtering.

    To test this, I further modified the same LIA example and compared the use of a BooleanClause and the use of a Filter:

    Q = category:/technology/computers/programming/methodology category:/philosophy/eastern +pubmonth:[200501 TO 201012]
    ----------
    Tao Te Ching ???
    1.4739084 = (MATCH) product of:
    2.2108626 = (MATCH) sum of:
    1.9717792 = (MATCH) weight(category:/philosophy/eastern in 4), product of:
    0.68659997 = queryWeight(category:/philosophy/eastern), product of:
    2.871802 = idf(docFreq=1, maxDocs=13)
    0.23908332 = queryNorm
    2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4), product of:
    1.0 = tf(termFreq(category:/philosophy/eastern)=1)
    2.871802 = idf(docFreq=1, maxDocs=13)
    1.0 = fieldNorm(field=category, doc=4)
    0.23908332 = (MATCH) ConstantScoreQuery(pubmonth:[200501 TO 201012]), product of:
    1.0 = boost
    0.23908332 = queryNorm
    0.6666667 = coord(2/3)

    Q = +(category:/technology/computers/programming/methodology category:/philosophy/eastern) +pubmonth:[200501 TO 201012]
    ----------
    Tao Te Ching ???
    1.224973 = (MATCH) sum of:
    0.9858896 = (MATCH) product of:
    1.9717792 = (MATCH) sum of:
    1.9717792 = (MATCH) weight(category:/philosophy/eastern in 4), product of:
    0.68659997 = queryWeight(category:/philosophy/eastern), product of:
    2.871802 = idf(docFreq=1, maxDocs=13)
    0.23908332 = queryNorm
    2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4), product of:
    1.0 = tf(termFreq(category:/philosophy/eastern)=1)
    2.871802 = idf(docFreq=1, maxDocs=13)
    1.0 = fieldNorm(field=category, doc=4)
    0.5 = coord(1/2)
    0.23908332 = (MATCH) ConstantScoreQuery(pubmonth:[200501 TO 201012]), product of:
    1.0 = boost
    0.23908332 = queryNorm

    Q = category:/technology/computers/programming/methodology category:/philosophy/eastern
    Date = pubmonth:[200501 TO 201112]
    ----------
    Tao Te Ching ???
    1.0153353 = (MATCH) product of:
    2.0306706 = (MATCH) sum of:
    2.0306706 = (MATCH) weight(category:/philosophy/eastern in 4), product of:
    0.70710677 = queryWeight(category:/philosophy/eastern), product of:
    2.871802 = idf(docFreq=1, maxDocs=13)
    0.24622406 = queryNorm
    2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4), product of:
    1.0 = tf(termFreq(category:/philosophy/eastern)=1)
    2.871802 = idf(docFreq=1, maxDocs=13)
    1.0 = fieldNorm(field=category, doc=4)
    0.5 = coord(1/2)

    Comparing the results, I see that:
    - maxDocs and IDF are the same;
    - queryNorm and coord can be different. The correct values are the ones obtained when using Filter; BooleanClauses introduce artificial query terms that affect these metrics;
    - the BooleanClause also introduces a ConstantScoreQuery that further impacts the "true" score.

    I would conclude that from the perspective of obtaining "true" scores, using Filter is preferred to using MUST BooleanClause in a BooleanQuery.

    The TF-IDF model (as well as other IR models) was developed for text-like features. The assumptions made in that model do not apply to numeric fields such as date or longitude/latitude, appropriate for faceted filtering, so the two models should not be mixed in a common query.

    Q3. Considering that all expert opinions that I've read in forums speak against Filter-ing, is there something that I'm missing ?

Similar Threads

  1. Querying in JDBC to oracle DB
    By 123456.kiran in forum JDBC
    Replies: 6
    Last Post: 03-31-2011, 12:47 PM
  2. Replies: 7
    Last Post: 06-22-2010, 11:49 AM
  3. Help needed with querying collections
    By Jawaharlal in forum Advanced Java
    Replies: 5
    Last Post: 04-28-2010, 05:43 PM
  4. Key combination
    By dejos456 in forum New To Java
    Replies: 9
    Last Post: 11-30-2009, 08:11 AM
  5. querying russian data from db problem
    By mr_empty in forum JDBC
    Replies: 0
    Last Post: 03-04-2008, 08:56 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •