Lucene score - confidence
I'm using lucene to retrieve relevant segments of a corpus based on a
given query. Every segment is represented as Document in the indexing.
Once the relevant segments are retrieved, I search for a Regex in them
to capture the requested information. It can happens that I found the
regex in two or more documents retrieved by lucene. Thus, I would like
to show a confidence score to each captured Regex. To make simple, I
means if the regex is found in the first retrieved document and in
fourth, the one retrieved in the first document is more certain to be
relevant since the its document was better scored than other one.
To show a confidence score, I tried to use scores returned by Lucene.
But, the lucene score are arbitrary and not normalized. So it's not
relevant to use. I wonder how we can have an arbitrary score (>1) when
the default scoring system in lucene is based on cosine measure. In a
simple case, the cosine score between two document vectors (obtained by
tf-idf) is between 0 and 1.
Since the lucene score is arbitrary and not normalized, if I want to
verify which query is more relevant to retrieve a document, how can I
compare the score of the first retrieved documents corresponding to each
query ? Is there any way to give a confidence score to each retrieved
document which make the comparison of the results of the different
search using the different queries possible?
Many thank :rolleyes:;)