not all tf/idf variants are probabilistic models, but a great many are if the term weights
are probabilities. if we just take straight, unmodified Term Frequency in a document, Inverse
Document Frequency in the corpus, and the Term Frequency in the query as 1, you are in fact
comparing the statistical properties of the query against the statistical properties of the
query. they are probabilities you are comparing. i can't think of many papers that come right
out and say it, but if you look at an individual term weight and can interpret it as a genuine
probability, the vector space model based on the weights is a probabilistic model. the derivation
is relatively straight forward to show it, if you have the right general model to start with.
once you start throwing in ad hoc normalizations, then things get out of whack and it's not
longer a probabilistic model.
the implementations that i have done are with a former company and that means secret and protected
by various intellectual property rights. however, i can sketch here the general approach one
has to take and an outline of the derivation that unifies probabilistic models with vector
space models and at the same time incorporate pairwise interterm correlation. in fact, the
pairwise interterm correlations are a fundamental assumption. once you do all this, you can
show that the traditional vector space model is a special case of a pairwise interterm correlation
model. for those that are interested in advanced matrix algebra and some basic statistics,
it should be very interesting. if only i had a published paper, i would post it. unfortunately,
what i have is very obtuse because it's protected. the only paper that started out was submitted
to SIGIR but rejected by all but one referee. that one thought this was a tremendous unification
of the two methods, but academic journals being what they are, when 4 out of 5 referees can't
understand the paper, it doesn't get published. i may brush it off and enlarge into a much
longer paper for the Journal of IR, but once again, unless you are comfortable with probability
theory and matrix theory, you are not going to follow it.
so, who is game for a tutorial on the derivation?
Herb...
Original Message
From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
Sent: Thursday, December 04, 2003 5:09 AM
To: Lucene Users List
Subject: AW: Probabilistic Model in Lucene  possible?
Hi Herb,
thank you for your insights.
>>
but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model.
>>
Can you send some pointers to help me understand that? Are all TF/IDFvariants
probabilistic models? If so, what makes any model a nonprobabilistic one?
If you claim that TF/IDF is probabilistic, then the plain cosine (an extreme
form of TF/IDF, with IDF for all terms being considered constant) of VSM would
also be a probabilistic model.
>>
it's got strange normalizations though that doesn't allow comparisons of rank values across
queries.
>>
Lucene's internal ranking sometimes returns values > 1.0, these are then normalized to
1.0,
adjusting other rankings accordingly. While I have nothing to say against this  it's a hack,
but useful  it makes comparing the rank values across queries really difficult. It's like
using different scales whenever you measure something different, and then you do not tell
anyone about it.
>>
it isn't terribly hard to make a normalized probabilistic model that allows comparing of document
scores across queries and assign a meaning to the score. i've done it.
>>
Stop bragging, send us your Similarity implementation :)
Regards,
Karsten
Ursprüngliche Nachricht
Von: Chong, Herb [mailto:HChong3@bloomberg.com]
Gesendet: Mittwoch, 3. Dezember 2003 23:01
An: Lucene Users List
Betreff: RE: Probabilistic Model in Lucene  possible?
i think i am missing the original question, but by most accepted definitions, the tf/idf model
in Lucene is a probabilistic model. it's got strange normalizations though that doesn't allow
comparisons of rank values across queries.
it isn't terribly hard to make a normalized probabilistic model that allows comparing of document
scores across queries and assign a meaning to the score. i've done it. however, that means
abandoning idf and keeping actual term frequencies for each document and document size. once
you normalize this way, you can intermingle document scores from different queries and different
corpora and make statements about the absolute value of the score. it also leads directly
into the discussion we had earlier about interterm correlations and how to handle them properly
since the full interterm probabilistic model has as a special case the traditional tf/idf
model. interjecting Boolean conditions and boost makes the model much more complicated.
Herb....
Original Message
From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
Sent: Wednesday, December 03, 2003 4:51 PM
To: Lucene Users List
Subject: AW: Probabilistic Model in Lucene  possible?
>>
I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible.
>>
Sorry, I have no idea about how to use a probabilistic approach with
Lucene, but if anyone does so, I would like to know, too.
I am currently puzzled by a related question: I would like to know if there are any approaches
to get a confidence value for relevance
rather than a ranking. I.e., it would be nice to have a ranking
weight whose value has some kind of semantics such that we could
compare results from different queries. Can probabilistic approches
do anything like this?

To unsubscribe, email: luceneuserunsubscribe@jakarta.apache.org
For additional commands, email: luceneuserhelp@jakarta.apache.org

To unsubscribe, email: luceneuserunsubscribe@jakarta.apache.org
For additional commands, email: luceneuserhelp@jakarta.apache.org

To unsubscribe, email: luceneuserunsubscribe@jakarta.apache.org
For additional commands, email: luceneuserhelp@jakarta.apache.org
