lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: BM25 Scoring Patch
Date Wed, 17 Feb 2010 15:31:19 GMT
Yuval, i apologize for not having an intelligent response for your question
(if i did i would try to formulate it as a patch), but I too would like for
it to be extremely easy... maybe we can iterate on the patch.

below is how i feel about it:

i guess theoretically, the use of Similarity is how we would implement a
pluggable scoring formula, i think already supported by Solr. it would be
nice if BM25 could be just another Similarity, but i'm not even sure thats
realistic in the near future.

yet if we don't do the hard work up front to make it easy to plug in things
like BM25, then no one will implement additional scoring formulas for
Lucene, we currently make it terribly difficult to do this.

in the BM25 case we are just lucky, as Joaquin went thru a lot of
work/jumped thru a lot of hoops to make it happen.

On Wed, Feb 17, 2010 at 3:36 AM, Yuval Feinstein <yuvalf@answers.com> wrote:

> This is very interesting and much friendlier than a flame war.
> My practical question for Robert is:
> How can we modify the BM25 patch so that it:
> a) Becomes part of Lucene contrib.
> b) Be easier to use (preventing mistakes  such as Ivan's using the BM25
> similarity during indexing).
> c) Proceeds towards a pluggable scoring formula (Ideally, we should have an
> IndexReader/IndexSearcher/IndexWriter
> constructor enabling specifying a scoring model through an enum, with the
> default being, well, Lucene's default scoring model)?
> The easier it is to use, the more experiments people can make, and see how
> it works for them.
> A future "marketing" step could be adding BM25 to Solr, to further ease
> experimentation.
> TIA,
> Yuval
>
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Tuesday, February 16, 2010 10:38 PM
> To: java-user@lucene.apache.org
> Subject: Re: BM25 Scoring Patch
>
> Joaquin, I have a typical methodology where I don't optimize any scoring
> params: be it BM25 params (I stick with your defaults), or lnb.ltc params
> (i
> stick with default slope). When doing query expansion i don't modify the
> defaults for MoreLikeThis either.
>
> I've found that changing these params can have a significant difference in
> retrieval performance, which is interesting, but I'm typically focused on
> text analysis (how is the text indexed?/stemming/stopwords). I also feel
> that such things are corpus-specific, which i generally try to avoid in my
> work...
>
> for example, in analysis work,  the text collection often has a majority of
> text in a specific tense (i.e. news), so i don't at all try to tune any
> part
> of analysis as I worry this would be corpus-specific... I do the same with
> scoring.
>
> As far as why some models perform better than others for certain languages,
> I think this is a million-dollar question. But my intuition (I don't have
> references or anything to back this up), is that probabilistic models
> outperform vector-space models when you are using approaches like n-grams:
> you don't have nice stopwords lists, stemming, decompounding etc.
>
> This is particularly interesting to me, as probabilistic model + ngram is a
> very general multilingual approach that I would like to have working well
> in
> Lucene, its also important as a "default" when we don't have a nicely tuned
> analyzer available that will work well with a vector space model. In my
> opinion, vector-space tends to fall apart without good language support.
>
>
> On Tue, Feb 16, 2010 at 3:23 PM, JOAQUIN PEREZ IGLESIAS <
> joaquin.perez@lsi.uned.es> wrote:
>
> > Ok,
> >
> > I'm not advocating the BM25 patch neither, unfortunately BM25 was not my
> > idea :-))), and I'm sure that the implementation can be improved.
> >
> > When you use the BM25 implementation, are you optimising the parameters
> > specifically per collection? (It is a key factor for improving BM25
> > performance).
> >
> > Why do you think that BM25 works better for English than in other
> > languages (apart of experiments). What are your intuitions?
> >
> > I dont't have too much experience on languages moreover of Spanish and
> > English, and it sounds pretty interesting.
> >
> > Kind Regards.
> >
> > P.S: Maybe this is not a topic for this list???
> >
> >
> > > Joaquin, I don't see this as a flame war? First of all I'd like to
> > > personally thank you for your excellent BM25 implementation!
> > >
> > > I think the selection of a retrieval model depends highly on the
> > > language/indexing approach, i.e. if we were talking East Asian
> languages
> > I
> > > think we want a probabilistic model: no argument there!
> > >
> > > All i said was that it is a myth that BM25 is "always" better than
> > > Lucene's
> > > scoring model, it really depends on what you are trying to do, how you
> > are
> > > indexing your text, properties of your corpus, how your queries are
> > > running.
> > >
> > > I don't even want to come across as advocating the lnb.ltc approach
> > > either,
> > > sure I wrote the patch, but this means nothing. I only like it as its
> > > currently a simple integration into Lucene, but long-term its best if
> we
> > > can
> > > support other models also!
> > >
> > > Finally I think there is something to be said for Lucene's default
> > > retrieval
> > > model, which in my (non-english) findings across the board isn't
> terrible
> > > at
> > > all... then again I am working with languages where analysis is really
> > the
> > > thing holding Lucene back, not scoring.
> > >
> > > On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN PEREZ IGLESIAS <
> > > joaquin.perez@lsi.uned.es> wrote:
> > >
> > >> Just some final comments (as I said I'm not interested in flame wars),
> > >>
> > >> If I obtain better results there are not problem with pooling
> otherwise
> > >> it
> > >> is biased.
> > >> The only important thing (in my opinion) is that it cannot be said
> that
> > >> BM25 is a myth.
> > >> Yes, you are right there is not an only ranking model that beats the
> > >> rest,
> > >> but there are models that generally show a better performance in more
> > >> cases.
> > >>
> > >> About CLEF I have had the same experience (VSM vs BM25) on Spanish and
> > >> English (WebCLEF) and Q&A (ResPubliQA)
> > >>
> > >> Ivan checks the parameters (b and k1), probably you can improve your
> > >> results. (that's the bad part of BM25).
> > >>
> > >> Finally we are just speaking of personal experience, so obviously you
> > >> should use the best model for your data and your own experience, on IR
> > >> there are not myths neither best ranking models. If any of us is able
> to
> > >> find the &#8220;best&#8221;  ranking model, or is able to prove
that
> any
> > >> state-of-the art is a myth he should send these results to the SIGIR
> > >> conference.
> > >>
> > >> Ivan, Robert good luck with your experiments, as I said the good part
> of
> > >> IR is that you can always make experiments on your own.
> > >>
> > >> > I don't think its really a competition, I think preferably we should
> > >> have
> > >> > the flexibility to change the scoring model in lucene actually?
> > >> >
> > >> > I have found lots of cases where VSM improves on BM25, but then
> again
> > >> I
> > >> > don't work with TREC stuff, as I work with non-english collections.
> > >> >
> > >> > It doesn't contradict years of research to say that VSM isn't a
> > >> > state-of-the-art model, besides the TREC-4 results, there are CLEF
> > >> results
> > >> > where VSM models perform competitively or exceed (Finnish, Russian,
> > >> etc)
> > >> > BM25/DFR/etc.
> > >> >
> > >> > It depends on the collection, there isn't a 'best retrieval
> formula'.
> > >> >
> > >> > Note: I have no bias against BM-25, but its definitely a myth to say
> > >> there
> > >> > is a single retrieval formula that is the 'best' across the board.
> > >> >
> > >> >
> > >> > On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
> > >> > joaquin.perez@lsi.uned.es> wrote:
> > >> >
> > >> >> By the way,
> > >> >>
> > >> >> I don't want to start a flame war VSM vs BM25, but I really believe
> > >> that
> > >> >> I
> > >> >> have to express my opinion as Robert has done. In my experience,
I
> > >> have
> > >> >> never found a case where VSM improves significantly BM25. Maybe
you
> > >> can
> > >> >> find some cases under some very specific collection
> characteristics,
> > >> (as
> > >> >> average length of 300 vs 3000) or a bad usage of BM25 (not proper
> > >> >> parameters) where it can happen.
> > >> >>
> > >> >> BM25 is not just only a different way of length normalization,
it
> is
> > >> >> based
> > >> >> strongly in the probabilistic framework, and parametrises
> frequencies
> > >> >> and
> > >> >> length. This is probably the most successful ranking model of
the
> > >> last
> > >> >> years in Information Retrieval.
> > >> >>
> > >> >> I have never read a paper where VSM  improves any of the
> > >> >> state-of-the-art
> > >> >> ranking models (Language Models, DFR, BM25,...),  although the
VSM
> > >> with
> > >> >> pivoted normalisation length can obtain nice results. This can
be
> > >> proved
> > >> >> checking the last years of the TREC competition.
> > >> >>
> > >> >> Honestly to say that is a myth that BM25 improves VSM breaks the
> last
> > >> 10
> > >> >> or 15 years of research on Information Retrieval, and I really
> > >> believe
> > >> >> that is not accurate.
> > >> >>
> > >> >> The good thing of Information Retrieval is that you can always
make
> > >> your
> > >> >> owns experiments and you can use the experience of a lot of years
> of
> > >> >> research.
> > >> >>
> > >> >> PS: This opinion is based on experiments on TREC and CLEF
> > >> collections,
> > >> >> obviously we can start a debate about the suitability of this
type
> of
> > >> >> experimentation (concept of relevance, pooling, relevance
> > >> judgements),
> > >> >> but
> > >> >> this is a much more complex topic and I believe is far from what
we
> > >> are
> > >> >> dealing here.
> > >> >>
> > >> >> PS2: In relation with TREC4 Cornell used a pivoted length
> > >> normalisation
> > >> >> and they were applying pseudo-relevance feedback, what honestly
> makes
> > >> >> much
> > >> >> more difficult the analysis of the results. Obviously their results
> > >> were
> > >> >> part of the pool.
> > >> >>
> > >> >> Sorry for the huge mail :-))))
> > >> >>
> > >> >> > Hi Ivan,
> > >> >> >
> > >> >> > the problem is that unfortunately BM25
> > >> >> > cannot be implemented overwriting
> > >> >> > the Similarity interface. Therefore BM25Similarity
> > >> >> > only computes the classic probabilistic IDF (what is
> > >> >> > interesting only at search time).
> > >> >> > If you set BM25Similarity at indexing time
> > >> >> > some basic stats are not stored
> > >> >> > correctly in the segments (like docs length).
> > >> >> >
> > >> >> > When you use BM25BooleanQuery this class
> > >> >> > will set automatically the BM25Similarity for you,
> > >> >> > therefore you don't need to do this explicitly.
> > >> >> >
> > >> >> > I tried to make this implementation with the focus on
> > >> >> > not interfering on the typical use of Lucene (so no changing
> > >> >> > DefaultSimilarity).
> > >> >> >
> > >> >> >> Joaquin, Robert,
> > >> >> >>
> > >> >> >> I followed Joaquin's recommendation and removed the call
to set
> > >> >> >> similarity
> > >> >> >> to BM25 explicitly (indexer, searcher).  The results
showed 55%
> > >> >> >> improvement for the MAP score (0.141->0.219) over
default
> > >> similarity.
> > >> >> >>
> > >> >> >> Joaquin, how would setting the similarity to BM25 explicitly
> make
> > >> the
> > >> >> >> score worse?
> > >> >> >>
> > >> >> >> Thank you,
> > >> >> >>
> > >> >> >> Ivan
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> --- On Tue, 2/16/10, Robert Muir <rcmuir@gmail.com>
wrote:
> > >> >> >>
> > >> >> >>> From: Robert Muir <rcmuir@gmail.com>
> > >> >> >>> Subject: Re: BM25 Scoring Patch
> > >> >> >>> To: java-user@lucene.apache.org
> > >> >> >>> Date: Tuesday, February 16, 2010, 11:36 AM
> > >> >> >>> yes Ivan, if possible please report
> > >> >> >>> back any findings you can on the
> > >> >> >>> experiments you are doing!
> > >> >> >>>
> > >> >> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> > >> >> >>> <
> > >> >> >>> joaquin.perez@lsi.uned.es>
> > >> >> >>> wrote:
> > >> >> >>>
> > >> >> >>> > Hi Ivan,
> > >> >> >>> >
> > >> >> >>> > You shouldn't set the BM25Similarity for indexing
or
> > >> >> >>> searching.
> > >> >> >>> > Please try removing the lines:
> > >> >> >>> >   writer.setSimilarity(new
> > >> >> >>> BM25Similarity());
> > >> >> >>> >   searcher.setSimilarity(sim);
> > >> >> >>> >
> > >> >> >>> > Please let us/me know if you improve your results
with
> > >> >> >>> these changes.
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> > Robert Muir escribió:
> > >> >> >>> >
> > >> >> >>> >  Hi Ivan, I've seen many cases where BM25
> > >> >> >>> performs worse than Lucene's
> > >> >> >>> >> default Similarity. Perhaps this is just
another
> > >> >> >>> one?
> > >> >> >>> >>
> > >> >> >>> >> Again while I have not worked with this
particular
> > >> >> >>> collection, I looked at
> > >> >> >>> >> the statistics and noted that its composed
of
> > >> >> >>> several 'sub-collections':
> > >> >> >>> >> for
> > >> >> >>> >> example the PAT documents on disk 3 have
an
> > >> >> >>> average doc length of 3543,
> > >> >> >>> >> but
> > >> >> >>> >> the AP documents on disk 1 have an avg doc
length
> > >> >> >>> of 353.
> > >> >> >>> >>
> > >> >> >>> >> I have found on other collections that any
> > >> >> >>> advantages of BM25's document
> > >> >> >>> >> length normalization fall apart when 'average
> > >> >> >>> document length' doesn't
> > >> >> >>> >> make
> > >> >> >>> >> a whole lot of sense (cases like this).
> > >> >> >>> >>
> > >> >> >>> >> For this same reason, I've only found a
few
> > >> >> >>> collections where BM25's doc
> > >> >> >>> >> length normalization is really significantly
> > >> >> >>> better than Lucene's.
> > >> >> >>> >>
> > >> >> >>> >> In my opinion, the results on a particular
test
> > >> >> >>> collection or 2 have
> > >> >> >>> >> perhaps
> > >> >> >>> >> been taken too far and created a myth that
BM25 is
> > >> >> >>> always superior to
> > >> >> >>> >> Lucene's scoring... this is not true!
> > >> >> >>> >>
> > >> >> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> > >> >> >>> <iprovalo@yahoo.com>
> > >> >> >>> >> wrote:
> > >> >> >>> >>
> > >> >> >>> >>  I applied the Lucene patch mentioned in
> > >> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091
and
> > >> >> >>> ran the MAP
> > >> >> >>> >>> numbers
> > >> >> >>> >>> on TREC-3 collection using topics
> > >> >> >>> 151-200.  I am not getting worse
> > >> >> >>> >>> results
> > >> >> >>> >>> comparing to Lucene DefaultSimilarity.
 I
> > >> >> >>> suspect, I am not using it
> > >> >> >>> >>> correctly.  I have single field
> > >> >> >>> documents.  This is the process I use:
> > >> >> >>> >>>
> > >> >> >>> >>> 1. During the indexing, I am setting
the
> > >> >> >>> similarity to BM25 as such:
> > >> >> >>> >>>
> > >> >> >>> >>> IndexWriter writer = new IndexWriter(dir,
new
> > >> >> >>> StandardAnalyzer(
> > >> >> >>> >>>
> > >> >> >>>    Version.LUCENE_CURRENT), true,
> > >> >> >>> >>>
> > >> >> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
> > >> >> >>> >>> writer.setSimilarity(new BM25Similarity());
> > >> >> >>> >>>
> > >> >> >>> >>> 2. During the Precision/Recall measurements,
I
> > >> >> >>> am using a
> > >> >> >>> >>> SimpleBM25QQParser extension I added
to the
> > >> >> >>> benchmark:
> > >> >> >>> >>>
> > >> >> >>> >>> QualityQueryParser qqParser = new
> > >> >> >>> SimpleBM25QQParser("title", "TEXT");
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>> 3. Here is the parser code (I set an
avg doc
> > >> >> >>> length here):
> > >> >> >>> >>>
> > >> >> >>> >>> public Query parse(QualityQuery qq)
throws
> > >> >> >>> ParseException {
> > >> >> >>> >>>   BM25Parameters.setAverageLength(indexField,
> > >> >> >>> 798.30f);//avg doc length
> > >> >> >>> >>>   BM25Parameters.setB(0.5f);//tried
> > >> >> >>> default values
> > >> >> >>> >>>   BM25Parameters.setK1(2f);
> > >> >> >>> >>>   return query = new
> > >> >> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
> > >> >> >>> >>> new
> > >> >> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> > >> >> >>> >>> }
> > >> >> >>> >>>
> > >> >> >>> >>> 4. The searcher is using BM25 similarity:
> > >> >> >>> >>>
> > >> >> >>> >>> Searcher searcher = new IndexSearcher(dir,
> > >> >> >>> true);
> > >> >> >>> >>> searcher.setSimilarity(sim);
> > >> >> >>> >>>
> > >> >> >>> >>> Am I missing some steps?  Does anyone
> > >> >> >>> have experience with this code?
> > >> >> >>> >>>
> > >> >> >>> >>> Thanks,
> > >> >> >>> >>>
> > >> >> >>> >>> Ivan
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>>
> > >> ---------------------------------------------------------------------
> > >> >> >>> >>> To unsubscribe, e-mail:
> > >> java-user-unsubscribe@lucene.apache.org
> > >> >> >>> >>> For additional commands, e-mail:
> > >> >> java-user-help@lucene.apache.org
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> > --
> > >> >> >>> >
> > >> >> >>> -----------------------------------------------------------
> > >> >> >>> > Joaquín Pérez Iglesias
> > >> >> >>> > Dpto. Lenguajes y Sistemas Informáticos
> > >> >> >>> > E.T.S.I. Informática (UNED)
> > >> >> >>> > Ciudad Universitaria
> > >> >> >>> > C/ Juan del Rosal nº 16
> > >> >> >>> > 28040 Madrid - Spain
> > >> >> >>> > Phone. +34 91 398 89 19
> > >> >> >>> > Fax    +34 91 398 65 35
> > >> >> >>> > Office  2.11
> > >> >> >>> > Email: joaquin.perez@lsi.uned.es
> > >> >> >>> > web:
> > >> http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> <
> http://nlp.uned.es/%7Ejperezi/><
> > http://nlp.uned.es/%7Ejperezi/>
> > >> >> <http://nlp.uned.es/%7Ejperezi/><
> > >> >> http://nlp.uned.es/%7Ejperezi/>
> > >> >> >>> >
> > >> >> >>> -----------------------------------------------------------
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>>
> > >> ---------------------------------------------------------------------
> > >> >> >>> > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > >> >> >>> > For additional commands, e-mail:
> > >> java-user-help@lucene.apache.org
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>>
> > >> >> >>>
> > >> >> >>> --
> > >> >> >>> Robert Muir
> > >> >> >>> rcmuir@gmail.com
> > >> >> >>>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > ---------------------------------------------------------------------
> > >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> >> >> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > >> >> >>
> > >> >> >>
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > ---------------------------------------------------------------------
> > >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> >> > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > >> >> >
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >> >>
> > >> >>
> > >> >
> > >> >
> > >> > --
> > >> > Robert Muir
> > >> > rcmuir@gmail.com
> > >> >
> > >>
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message