lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Hit score and confidence ratio in results
Date Thu, 02 Dec 2010 01:27:43 GMT
If I understand your problem space, you're trying to compare scores
across two different queries. Don't do this, it is meaningless. Scores
are only valid for comparing documents' relevance within a single
query. How would one compare scores between two queries like
text:nonsense
name:terrible
? Different fields, different values. What does it mean that the score
for a doc matching the first query has a higher or lower score than
the first doc matching the second query?

As for how scores are > 1, have you looked at the scoring
formula here?
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
and here? http://lucene.apache.org/java/2_4_0/scoring.html#Understanding the
Scoring Formula

Don't forget that there can also be boosts...

Best
Erick

On Wed, Dec 1, 2010 at 7:56 AM, Hassan Saneifar <hassan.saneifar@lirmm.fr>wrote:

> Hello,
> I'm using lucene to retrieve relevant segments of a corpus based on a
> given query. Every segment is represented as Document in the indexing.
> Once the relevant segments are retrieved, I search for a Regex in them
> to capture the requested information. It can happens that I found the
> regex in two or more documents retrieved by lucene. Thus, I would like
> to show a confidence score to each captured Regex. To make simple, I
> means if the regex is found in the first retrieved document and in
> fourth, the one retrieved in the first document is more certain to be
> relevant since the its document was better scored than other one.
>
> To show a confidence score, I tried to use scores returned by Lucene.
> But, the lucene score are arbitrary and not normalized. So it's not
> relevant to use. I wonder how we can have an arbitrary score (>1) when
> the default scoring system in lucene is based on cosine measure. In a
> simple case, the cosine score between two document vectors (obtained by
> tf-idf) is between 0 and 1.
>
> Since the lucene score is arbitrary and not normalized, if I want to
> verify which query is more relevant to retrieve a document, how can I
> compare the score of the first retrieved documents corresponding to each
> query ? Is there any way to give a confidence score to each retrieved
> document which make the comparison of the results of the different
> search using the different queries possible?
>
> Many thank
> Best regards.
>
> --
> Hassan Saneifar, PhD. student,
> University Montpellier II
> LIRMM laboratory
> 161 rue Ada 34392 Montpellier, France
> Tel: +33 (0)467 41 85 41
> Fax: +33 (0)467 41 85 00
> URL: http://www.lirmm.fr/~saneifar
>
> --
> Hassan Saneifar, R&D engineer
> Satin IP Technologies
> Tel: +33 (0)467 13 00 86
> URL: http://www.satin-ip.com
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message