not all tf/idf variants are probabilistic models, but a great many are if the term weights are probabilities. if we just take straight, unmodified Term Frequency in a document, Inverse Document Frequency in the corpus, and the Term Frequency in the query as 1, you are in fact comparing the statistical properties of the query against the statistical properties of the query. they are probabilities you are comparing. i can't think of many papers that come right out and say it, but if you look at an individual term weight and can interpret it as a genuine probability, the vector space model based on the weights is a probabilistic model. the derivation is relatively straight forward to show it, if you have the right general model to start with. once you start throwing in ad hoc normalizations, then things get out of whack and it's not longer a probabilistic model.
the implementations that i have done are with a former company and that means secret and protected by various intellectual property rights. however, i can sketch here the general approach one has to take and an outline of the derivation that unifies probabilistic models with vector space models and at the same time incorporate pairwise interterm correlation. in fact, the pairwise interterm correlations are a fundamental assumption. once you do all this, you can show that the traditional vector space model is a special case of a pairwise interterm correlation model. for those that are interested in advanced matrix algebra and some basic statistics, it should be very interesting. if only i had a published paper, i would post it. unfortunately, what i have is very obtuse because it's protected. the only paper that started out was submitted to SIGIR but rejected by all but one referee. that one thought this was a tremendous unification of the two methods, but academic journals being what they are, when 4 out of 5 referees can't understand the paper, it doesn't get published. i may brush it off and enlarge into a much longer paper for the Journal of IR, but once again, unless you are comfortable with probability theory and matrix theory, you are not going to follow it.
so, who is game for a tutorial on the derivation?
Herb...
-----Original Message-----
From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
Sent: Thursday, December 04, 2003 5:09 AM
To: Lucene Users List
Subject: AW: Probabilistic Model in Lucene - possible?
Hi Herb,
thank you for your insights.
>>
but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model.
>>
Can you send some pointers to help me understand that? Are all TF/IDF-variants
probabilistic models? If so, what makes any model a non-probabilistic one?
If you claim that TF/IDF is probabilistic, then the plain cosine (an extreme
form of TF/IDF, with IDF for all terms being considered constant) of VSM would
also be a probabilistic model.
>>
it's got strange normalizations though that doesn't allow comparisons of rank values across queries.
>>
Lucene's internal ranking sometimes returns values > 1.0, these are then normalized to 1.0,
adjusting other rankings accordingly. While I have nothing to say against this - it's a hack,
but useful - it makes comparing the rank values across queries really difficult. It's like
using different scales whenever you measure something different, and then you do not tell
anyone about it.
>>
it isn't terribly hard to make a normalized probabilistic model that allows comparing of document scores across queries and assign a meaning to the score. i've done it.
>>
Stop bragging, send us your Similarity implementation :)
Regards,
Karsten
-----Ursprüngliche Nachricht-----
Von: Chong, Herb [mailto:HChong3@bloomberg.com]
Gesendet: Mittwoch, 3. Dezember 2003 23:01
An: Lucene Users List
Betreff: RE: Probabilistic Model in Lucene - possible?
i think i am missing the original question, but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model. it's got strange normalizations though that doesn't allow comparisons of rank values across queries.
it isn't terribly hard to make a normalized probabilistic model that allows comparing of document scores across queries and assign a meaning to the score. i've done it. however, that means abandoning idf and keeping actual term frequencies for each document and document size. once you normalize this way, you can intermingle document scores from different queries and different corpora and make statements about the absolute value of the score. it also leads directly into the discussion we had earlier about interterm correlations and how to handle them properly since the full interterm probabilistic model has as a special case the traditional tf/idf model. interjecting Boolean conditions and boost makes the model much more complicated.
Herb....
-----Original Message-----
From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
Sent: Wednesday, December 03, 2003 4:51 PM
To: Lucene Users List
Subject: AW: Probabilistic Model in Lucene - possible?
>>
I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible.
>>
Sorry, I have no idea about how to use a probabilistic approach with
Lucene, but if anyone does so, I would like to know, too.
I am currently puzzled by a related question: I would like to know if there are any approaches to get a confidence value for relevance
rather than a ranking. I.e., it would be nice to have a ranking
weight whose value has some kind of semantics such that we could
compare results from different queries. Can probabilistic approches
do anything like this?
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org