lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ambiese...@gmx.de
Subject RE: Probabilistic Model in Lucene - possible?
Date Fri, 05 Dec 2003 20:37:57 GMT
I guess you mean "Modern Information Retrieval" ... I would be a little bit
careful since this book has theoretical glasses on. It might look more
difficult than expected. However I would like to discuss this further. How could it
be archived to get the values your are writing about? Any first ideas?

Cheers,
Ralf

> Deal all,
> 
> I am interested in implement a probabilistic model in Lucene as well.
> I checked the book titled "model information retrieval" authored by
> Ricardo 
> Baeza-Yates and Berthier Ribeiro-Neto, it seems to me that the 
> implementation is not very complicated when we use Lucene's IndexReader 
> class, almost all the parameters needed are there: the total number of 
> document in the index (collection), the number of documents having a 
> particular term, that is it. Probably we need to find out a satisfied
> method
> of defining the weights of terms in the documents as well as in the query.
> 
> Cheers,
> 
> Shengli
> 
> 
> 
> Adam Saltiel <adam.saltiel@btinternet.com> said:
> 
> > Herb,
> > Any one game ... ?
> > No takers? I would be very interested, but maybe beyond what can be
> > posted in a mail list. I'd be equally interested in any references you
> > may have.
> > As we are on this subject how does LSI and the similar CNG (context
> > network graph) fit into the model used by lucene. Could lucene be
> > massaged to implement different mathematical models of search and
> > retrieval, if so how modular are the core functions?
> > 
> > Adam Saltiel
> > 
> > 
> > > -----Original Message-----
> > > From: Chong, Herb [mailto:HChong3@bloomberg.com]
> > > Sent: Thursday, December 04, 2003 1:53 PM
> > > To: Lucene Users List
> > > Subject: RE: Probabilistic Model in Lucene - possible?
> > >
> > > not all tf/idf variants are probabilistic models, but a great many are
> > if
> > > the term weights are probabilities. if we just take straight,
> > unmodified
> > > Term Frequency in a document, Inverse Document Frequency in the
> > corpus,
> > > and the Term Frequency in the query as 1, you are in fact comparing
> > the
> > > statistical properties of the query against the statistical properties
> > of
> > > the query. they are probabilities you are comparing. i can't think of
> > many
> > > papers that come right out and say it, but if you look at an
> > individual
> > > term weight and can interpret it as a genuine probability, the vector
> > > space model based on the weights is a probabilistic model. the
> > derivation
> > > is relatively straight forward to show it, if you have the right
> > general
> > > model to start with. once you start throwing in ad hoc normalizations,
> > > then things get out of whack and it's not longer a probabilistic
> > model.
> > >
> > > the implementations that i have done are with a former company and
> > that
> > > means secret and protected by various intellectual property rights.
> > > however, i can sketch here the general approach one has to take and an
> > > outline of the derivation that unifies probabilistic models with
> > vector
> > > space models and at the same time incorporate pairwise interterm
> > > correlation. in fact, the pairwise interterm correlations are a
> > > fundamental assumption. once you do all this, you can show that the
> > > traditional vector space model is a special case of a pairwise
> > interterm
> > > correlation model. for those that are interested in advanced matrix
> > > algebra and some basic statistics, it should be very interesting. if
> > only
> > > i had a published paper, i would post it. unfortunately, what i have
> > is
> > > very obtuse because it's protected. the only paper that started out
> > was
> > > submitted to SIGIR but rejected by all but one referee. that one
> > thought
> > > this was a tremendous unification of the two methods, but academic
> > > journals being what they are, when 4 out of 5 referees can't
> > understand
> > > the paper, it doesn't get published. i may brush it off and enlarge
> > into a
> > > much longer paper for the Journal of IR, but once again, unless you
> > are
> > > comfortable with probability theory and matrix theory, you are not
> > going
> > > to follow it.
> > >
> > > so, who is game for a tutorial on the derivation?
> > >
> > > Herb...
> > >
> > > -----Original Message-----
> > > From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
> > > Sent: Thursday, December 04, 2003 5:09 AM
> > > To: Lucene Users List
> > > Subject: AW: Probabilistic Model in Lucene - possible?
> > >
> > >
> > >
> > > Hi Herb,
> > >
> > > thank you for your insights.
> > >
> > > >>
> > > but by most accepted definitions, the tf/idf model in Lucene is a
> > > probabilistic model.
> > > >>
> > >
> > > Can you send some pointers to help me understand that? Are all TF/IDF-
> > > variants
> > > probabilistic models? If so, what makes any model a non-probabilistic
> > one?
> > > If you claim that TF/IDF is probabilistic, then the plain cosine (an
> > > extreme
> > > form of TF/IDF, with IDF for all terms being considered constant) of
> > VSM
> > > would
> > > also be a probabilistic model.
> > >
> > > >>
> > > it's got strange normalizations though that doesn't allow comparisons
> > of
> > > rank values across queries.
> > > >>
> > >
> > > Lucene's internal ranking sometimes returns values > 1.0, these are
> > then
> > > normalized to 1.0,
> > > adjusting other rankings accordingly. While I have nothing to say
> > against
> > > this - it's a hack,
> > > but useful - it makes comparing the rank values across queries really
> > > difficult. It's like
> > > using different scales whenever you measure something different, and
> > then
> > > you do not tell
> > > anyone about it.
> > >
> > > >>
> > > it isn't terribly hard to make a normalized probabilistic model that
> > > allows comparing of document scores across queries and assign a
> > meaning to
> > > the score. i've done it.
> > > >>
> > >
> > > Stop bragging, send us your Similarity implementation :)
> > >
> > > Regards,
> > >
> > > Karsten
> > >
> > >
> > > -----Ursprüngliche Nachricht-----
> > > Von: Chong, Herb [mailto:HChong3@bloomberg.com]
> > > Gesendet: Mittwoch, 3. Dezember 2003 23:01
> > > An: Lucene Users List
> > > Betreff: RE: Probabilistic Model in Lucene - possible?
> > >
> > >
> > > i think i am missing the original question, but by most accepted
> > > definitions, the tf/idf model in Lucene is a probabilistic model. it's
> > got
> > > strange normalizations though that doesn't allow comparisons of rank
> > > values across queries.
> > >
> > > it isn't terribly hard to make a normalized probabilistic model that
> > > allows comparing of document scores across queries and assign a
> > meaning to
> > > the score. i've done it. however, that means abandoning idf and
> > keeping
> > > actual term frequencies for each document and document size. once you
> > > normalize this way, you can intermingle document scores from different
> > > queries and different corpora and make statements about the absolute
> > value
> > > of the score. it also leads directly into the discussion we had
> > earlier
> > > about interterm correlations and how to handle them properly since the
> > > full interterm probabilistic model has as a special case the
> > traditional
> > > tf/idf model. interjecting Boolean conditions and boost makes the
> > model
> > > much more complicated.
> > >
> > > Herb....
> > >
> > > -----Original Message-----
> > > From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
> > > Sent: Wednesday, December 03, 2003 4:51 PM
> > > To: Lucene Users List
> > > Subject: AW: Probabilistic Model in Lucene - possible?
> > >
> > > >>
> > > I would highly appreciate it if the experts here (especially Karsten
> > or
> > > Chong) look at my idea and tell me if this would be possible.
> > > >>
> > >
> > > Sorry, I have no idea about how to use a probabilistic approach with
> > > Lucene, but if anyone does so, I would like to know, too.
> > >
> > > I am currently puzzled by a related question: I would like to know if
> > > there are any approaches to get a confidence value for relevance
> > > rather than a ranking. I.e., it would be nice to have a ranking
> > > weight whose value has some kind of semantics such that we could
> > > compare results from different queries. Can probabilistic approches
> > > do anything like this?
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > 
> 
> 
> 
> -- 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message