lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten Konrad" <Karsten.Kon...@xtramind.com>
Subject AW: Probabilistic Model in Lucene - possible?
Date Thu, 04 Dec 2003 10:09:13 GMT

Hi Herb,

thank you for your insights.

>>
but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model. 
>>

Can you send some pointers to help me understand that? Are all TF/IDF-variants 
probabilistic models? If so, what makes any model a non-probabilistic one? 
If you claim that TF/IDF is probabilistic, then the plain cosine (an extreme 
form of TF/IDF, with IDF for all terms being considered constant) of VSM would
also be a probabilistic model.

>>
it's got strange normalizations though that doesn't allow comparisons of rank values across
queries.
>>

Lucene's internal ranking sometimes returns values > 1.0, these are then normalized to
1.0,
adjusting other rankings accordingly. While I have nothing to say against this - it's a hack,
but useful - it makes comparing the rank values across queries really difficult. It's like
using different scales whenever you measure something different, and then you do not tell
anyone about it. 

>>
it isn't terribly hard to make a normalized probabilistic model that allows comparing of document
scores across queries and assign a meaning to the score. i've done it.
>>

Stop bragging, send us your Similarity implementation :)

Regards,

Karsten


-----Urspr√ľngliche Nachricht-----
Von: Chong, Herb [mailto:HChong3@bloomberg.com] 
Gesendet: Mittwoch, 3. Dezember 2003 23:01
An: Lucene Users List
Betreff: RE: Probabilistic Model in Lucene - possible?


i think i am missing the original question, but by most accepted definitions, the tf/idf model
in Lucene is a probabilistic model. it's got strange normalizations though that doesn't allow
comparisons of rank values across queries.

it isn't terribly hard to make a normalized probabilistic model that allows comparing of document
scores across queries and assign a meaning to the score. i've done it. however, that means
abandoning idf and keeping actual term frequencies for each document and document size. once
you normalize this way, you can intermingle document scores from different queries and different
corpora and make statements about the absolute value of the score. it also leads directly
into the discussion we had earlier about interterm correlations and how to handle them properly
since the full interterm probabilistic model has as a special case the traditional tf/idf
model. interjecting Boolean conditions and boost makes the model much more complicated.

Herb....

-----Original Message-----
From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
Sent: Wednesday, December 03, 2003 4:51 PM
To: Lucene Users List
Subject: AW: Probabilistic Model in Lucene - possible?

>>
I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible.
>>

Sorry, I have no idea about how to use a probabilistic approach with 
Lucene, but if anyone does so, I would like to know, too. 

I am currently puzzled by a related question: I would like to know if there are any approaches
to get a confidence value for relevance 
rather than a ranking. I.e., it would be nice to have a ranking 
weight whose value has some kind of semantics such that we could 
compare results from different queries. Can probabilistic approches 
do anything like this? 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message