Deal all,
I am interested in implement a probabilistic model in Lucene as well.
I checked the book titled "model information retrieval" authored by Ricardo
Baeza-Yates and Berthier Ribeiro-Neto, it seems to me that the
implementation is not very complicated when we use Lucene's IndexReader
class, almost all the parameters needed are there: the total number of
document in the index (collection), the number of documents having a
particular term, that is it. Probably we need to find out a satisfied method
of defining the weights of terms in the documents as well as in the query.
Cheers,
Shengli
Adam Saltiel said:
> Herb,
> Any one game ... ?
> No takers? I would be very interested, but maybe beyond what can be
> posted in a mail list. I'd be equally interested in any references you
> may have.
> As we are on this subject how does LSI and the similar CNG (context
> network graph) fit into the model used by lucene. Could lucene be
> massaged to implement different mathematical models of search and
> retrieval, if so how modular are the core functions?
>
> Adam Saltiel
>
>
> > -----Original Message-----
> > From: Chong, Herb [mailto:HChong3@bloomberg.com]
> > Sent: Thursday, December 04, 2003 1:53 PM
> > To: Lucene Users List
> > Subject: RE: Probabilistic Model in Lucene - possible?
> >
> > not all tf/idf variants are probabilistic models, but a great many are
> if
> > the term weights are probabilities. if we just take straight,
> unmodified
> > Term Frequency in a document, Inverse Document Frequency in the
> corpus,
> > and the Term Frequency in the query as 1, you are in fact comparing
> the
> > statistical properties of the query against the statistical properties
> of
> > the query. they are probabilities you are comparing. i can't think of
> many
> > papers that come right out and say it, but if you look at an
> individual
> > term weight and can interpret it as a genuine probability, the vector
> > space model based on the weights is a probabilistic model. the
> derivation
> > is relatively straight forward to show it, if you have the right
> general
> > model to start with. once you start throwing in ad hoc normalizations,
> > then things get out of whack and it's not longer a probabilistic
> model.
> >
> > the implementations that i have done are with a former company and
> that
> > means secret and protected by various intellectual property rights.
> > however, i can sketch here the general approach one has to take and an
> > outline of the derivation that unifies probabilistic models with
> vector
> > space models and at the same time incorporate pairwise interterm
> > correlation. in fact, the pairwise interterm correlations are a
> > fundamental assumption. once you do all this, you can show that the
> > traditional vector space model is a special case of a pairwise
> interterm
> > correlation model. for those that are interested in advanced matrix
> > algebra and some basic statistics, it should be very interesting. if
> only
> > i had a published paper, i would post it. unfortunately, what i have
> is
> > very obtuse because it's protected. the only paper that started out
> was
> > submitted to SIGIR but rejected by all but one referee. that one
> thought
> > this was a tremendous unification of the two methods, but academic
> > journals being what they are, when 4 out of 5 referees can't
> understand
> > the paper, it doesn't get published. i may brush it off and enlarge
> into a
> > much longer paper for the Journal of IR, but once again, unless you
> are
> > comfortable with probability theory and matrix theory, you are not
> going
> > to follow it.
> >
> > so, who is game for a tutorial on the derivation?
> >
> > Herb...
> >
> > -----Original Message-----
> > From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
> > Sent: Thursday, December 04, 2003 5:09 AM
> > To: Lucene Users List
> > Subject: AW: Probabilistic Model in Lucene - possible?
> >
> >
> >
> > Hi Herb,
> >
> > thank you for your insights.
> >
> > >>
> > but by most accepted definitions, the tf/idf model in Lucene is a
> > probabilistic model.
> > >>
> >
> > Can you send some pointers to help me understand that? Are all TF/IDF-
> > variants
> > probabilistic models? If so, what makes any model a non-probabilistic
> one?
> > If you claim that TF/IDF is probabilistic, then the plain cosine (an
> > extreme
> > form of TF/IDF, with IDF for all terms being considered constant) of
> VSM
> > would
> > also be a probabilistic model.
> >
> > >>
> > it's got strange normalizations though that doesn't allow comparisons
> of
> > rank values across queries.
> > >>
> >
> > Lucene's internal ranking sometimes returns values > 1.0, these are
> then
> > normalized to 1.0,
> > adjusting other rankings accordingly. While I have nothing to say
> against
> > this - it's a hack,
> > but useful - it makes comparing the rank values across queries really
> > difficult. It's like
> > using different scales whenever you measure something different, and
> then
> > you do not tell
> > anyone about it.
> >
> > >>
> > it isn't terribly hard to make a normalized probabilistic model that
> > allows comparing of document scores across queries and assign a
> meaning to
> > the score. i've done it.
> > >>
> >
> > Stop bragging, send us your Similarity implementation :)
> >
> > Regards,
> >
> > Karsten
> >
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Chong, Herb [mailto:HChong3@bloomberg.com]
> > Gesendet: Mittwoch, 3. Dezember 2003 23:01
> > An: Lucene Users List
> > Betreff: RE: Probabilistic Model in Lucene - possible?
> >
> >
> > i think i am missing the original question, but by most accepted
> > definitions, the tf/idf model in Lucene is a probabilistic model. it's
> got
> > strange normalizations though that doesn't allow comparisons of rank
> > values across queries.
> >
> > it isn't terribly hard to make a normalized probabilistic model that
> > allows comparing of document scores across queries and assign a
> meaning to
> > the score. i've done it. however, that means abandoning idf and
> keeping
> > actual term frequencies for each document and document size. once you
> > normalize this way, you can intermingle document scores from different
> > queries and different corpora and make statements about the absolute
> value
> > of the score. it also leads directly into the discussion we had
> earlier
> > about interterm correlations and how to handle them properly since the
> > full interterm probabilistic model has as a special case the
> traditional
> > tf/idf model. interjecting Boolean conditions and boost makes the
> model
> > much more complicated.
> >
> > Herb....
> >
> > -----Original Message-----
> > From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
> > Sent: Wednesday, December 03, 2003 4:51 PM
> > To: Lucene Users List
> > Subject: AW: Probabilistic Model in Lucene - possible?
> >
> > >>
> > I would highly appreciate it if the experts here (especially Karsten
> or
> > Chong) look at my idea and tell me if this would be possible.
> > >>
> >
> > Sorry, I have no idea about how to use a probabilistic approach with
> > Lucene, but if anyone does so, I would like to know, too.
> >
> > I am currently puzzled by a related question: I would like to know if
> > there are any approaches to get a confidence value for relevance
> > rather than a ranking. I.e., it would be nice to have a ranking
> > weight whose value has some kind of semantics such that we could
> > compare results from different queries. Can probabilistic approches
> > do anything like this?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
--
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org