lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hassan Saneifar <>
Subject Hit score and confidence ratio in results
Date Wed, 01 Dec 2010 12:56:27 GMT
I'm using lucene to retrieve relevant segments of a corpus based on a
given query. Every segment is represented as Document in the indexing.
Once the relevant segments are retrieved, I search for a Regex in them
to capture the requested information. It can happens that I found the
regex in two or more documents retrieved by lucene. Thus, I would like
to show a confidence score to each captured Regex. To make simple, I
means if the regex is found in the first retrieved document and in
fourth, the one retrieved in the first document is more certain to be
relevant since the its document was better scored than other one.

To show a confidence score, I tried to use scores returned by Lucene.
But, the lucene score are arbitrary and not normalized. So it's not
relevant to use. I wonder how we can have an arbitrary score (>1) when
the default scoring system in lucene is based on cosine measure. In a
simple case, the cosine score between two document vectors (obtained by
tf-idf) is between 0 and 1.

Since the lucene score is arbitrary and not normalized, if I want to
verify which query is more relevant to retrieve a document, how can I
compare the score of the first retrieved documents corresponding to each
query ? Is there any way to give a confidence score to each retrieved
document which make the comparison of the results of the different
search using the different queries possible?

Many thank
Best regards.

Hassan Saneifar, PhD. student,
University Montpellier II
LIRMM laboratory
161 rue Ada 34392 Montpellier, France
Tel: +33 (0)467 41 85 41
Fax: +33 (0)467 41 85 00		
Hassan Saneifar, R&D engineer
Satin IP Technologies
Tel: +33 (0)467 13 00 86

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message