lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Feinstein <yuv...@answers.com>
Subject RE: BM25 Scoring Patch
Date Wed, 17 Feb 2010 08:36:19 GMT
This is very interesting and much friendlier than a flame war.
My practical question for Robert is:
How can we modify the BM25 patch so that it:
a) Becomes part of Lucene contrib.
b) Be easier to use (preventing mistakes  such as Ivan's using the BM25 similarity during
indexing).
c) Proceeds towards a pluggable scoring formula (Ideally, we should have an IndexReader/IndexSearcher/IndexWriter
constructor enabling specifying a scoring model through an enum, with the default being, well,
Lucene's default scoring model)?
The easier it is to use, the more experiments people can make, and see how it works for them.
A future "marketing" step could be adding BM25 to Solr, to further ease experimentation.
TIA,
Yuval


-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com]
Sent: Tuesday, February 16, 2010 10:38 PM
To: java-user@lucene.apache.org
Subject: Re: BM25 Scoring Patch

Joaquin, I have a typical methodology where I don't optimize any scoring
params: be it BM25 params (I stick with your defaults), or lnb.ltc params (i
stick with default slope). When doing query expansion i don't modify the
defaults for MoreLikeThis either.

I've found that changing these params can have a significant difference in
retrieval performance, which is interesting, but I'm typically focused on
text analysis (how is the text indexed?/stemming/stopwords). I also feel
that such things are corpus-specific, which i generally try to avoid in my
work...

for example, in analysis work,  the text collection often has a majority of
text in a specific tense (i.e. news), so i don't at all try to tune any part
of analysis as I worry this would be corpus-specific... I do the same with
scoring.

As far as why some models perform better than others for certain languages,
I think this is a million-dollar question. But my intuition (I don't have
references or anything to back this up), is that probabilistic models
outperform vector-space models when you are using approaches like n-grams:
you don't have nice stopwords lists, stemming, decompounding etc.

This is particularly interesting to me, as probabilistic model + ngram is a
very general multilingual approach that I would like to have working well in
Lucene, its also important as a "default" when we don't have a nicely tuned
analyzer available that will work well with a vector space model. In my
opinion, vector-space tends to fall apart without good language support.


On Tue, Feb 16, 2010 at 3:23 PM, JOAQUIN PEREZ IGLESIAS <
joaquin.perez@lsi.uned.es> wrote:

> Ok,
>
> I'm not advocating the BM25 patch neither, unfortunately BM25 was not my
> idea :-))), and I'm sure that the implementation can be improved.
>
> When you use the BM25 implementation, are you optimising the parameters
> specifically per collection? (It is a key factor for improving BM25
> performance).
>
> Why do you think that BM25 works better for English than in other
> languages (apart of experiments). What are your intuitions?
>
> I dont't have too much experience on languages moreover of Spanish and
> English, and it sounds pretty interesting.
>
> Kind Regards.
>
> P.S: Maybe this is not a topic for this list???
>
>
> > Joaquin, I don't see this as a flame war? First of all I'd like to
> > personally thank you for your excellent BM25 implementation!
> >
> > I think the selection of a retrieval model depends highly on the
> > language/indexing approach, i.e. if we were talking East Asian languages
> I
> > think we want a probabilistic model: no argument there!
> >
> > All i said was that it is a myth that BM25 is "always" better than
> > Lucene's
> > scoring model, it really depends on what you are trying to do, how you
> are
> > indexing your text, properties of your corpus, how your queries are
> > running.
> >
> > I don't even want to come across as advocating the lnb.ltc approach
> > either,
> > sure I wrote the patch, but this means nothing. I only like it as its
> > currently a simple integration into Lucene, but long-term its best if we
> > can
> > support other models also!
> >
> > Finally I think there is something to be said for Lucene's default
> > retrieval
> > model, which in my (non-english) findings across the board isn't terrible
> > at
> > all... then again I am working with languages where analysis is really
> the
> > thing holding Lucene back, not scoring.
> >
> > On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN PEREZ IGLESIAS <
> > joaquin.perez@lsi.uned.es> wrote:
> >
> >> Just some final comments (as I said I'm not interested in flame wars),
> >>
> >> If I obtain better results there are not problem with pooling otherwise
> >> it
> >> is biased.
> >> The only important thing (in my opinion) is that it cannot be said that
> >> BM25 is a myth.
> >> Yes, you are right there is not an only ranking model that beats the
> >> rest,
> >> but there are models that generally show a better performance in more
> >> cases.
> >>
> >> About CLEF I have had the same experience (VSM vs BM25) on Spanish and
> >> English (WebCLEF) and Q&A (ResPubliQA)
> >>
> >> Ivan checks the parameters (b and k1), probably you can improve your
> >> results. (that's the bad part of BM25).
> >>
> >> Finally we are just speaking of personal experience, so obviously you
> >> should use the best model for your data and your own experience, on IR
> >> there are not myths neither best ranking models. If any of us is able to
> >> find the &#8220;best&#8221;  ranking model, or is able to prove that
any
> >> state-of-the art is a myth he should send these results to the SIGIR
> >> conference.
> >>
> >> Ivan, Robert good luck with your experiments, as I said the good part of
> >> IR is that you can always make experiments on your own.
> >>
> >> > I don't think its really a competition, I think preferably we should
> >> have
> >> > the flexibility to change the scoring model in lucene actually?
> >> >
> >> > I have found lots of cases where VSM improves on BM25, but then again
> >> I
> >> > don't work with TREC stuff, as I work with non-english collections.
> >> >
> >> > It doesn't contradict years of research to say that VSM isn't a
> >> > state-of-the-art model, besides the TREC-4 results, there are CLEF
> >> results
> >> > where VSM models perform competitively or exceed (Finnish, Russian,
> >> etc)
> >> > BM25/DFR/etc.
> >> >
> >> > It depends on the collection, there isn't a 'best retrieval formula'.
> >> >
> >> > Note: I have no bias against BM-25, but its definitely a myth to say
> >> there
> >> > is a single retrieval formula that is the 'best' across the board.
> >> >
> >> >
> >> > On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
> >> > joaquin.perez@lsi.uned.es> wrote:
> >> >
> >> >> By the way,
> >> >>
> >> >> I don't want to start a flame war VSM vs BM25, but I really believe
> >> that
> >> >> I
> >> >> have to express my opinion as Robert has done. In my experience, I
> >> have
> >> >> never found a case where VSM improves significantly BM25. Maybe you
> >> can
> >> >> find some cases under some very specific collection characteristics,
> >> (as
> >> >> average length of 300 vs 3000) or a bad usage of BM25 (not proper
> >> >> parameters) where it can happen.
> >> >>
> >> >> BM25 is not just only a different way of length normalization, it is
> >> >> based
> >> >> strongly in the probabilistic framework, and parametrises frequencies
> >> >> and
> >> >> length. This is probably the most successful ranking model of the
> >> last
> >> >> years in Information Retrieval.
> >> >>
> >> >> I have never read a paper where VSM  improves any of the
> >> >> state-of-the-art
> >> >> ranking models (Language Models, DFR, BM25,...),  although the VSM
> >> with
> >> >> pivoted normalisation length can obtain nice results. This can be
> >> proved
> >> >> checking the last years of the TREC competition.
> >> >>
> >> >> Honestly to say that is a myth that BM25 improves VSM breaks the last
> >> 10
> >> >> or 15 years of research on Information Retrieval, and I really
> >> believe
> >> >> that is not accurate.
> >> >>
> >> >> The good thing of Information Retrieval is that you can always make
> >> your
> >> >> owns experiments and you can use the experience of a lot of years of
> >> >> research.
> >> >>
> >> >> PS: This opinion is based on experiments on TREC and CLEF
> >> collections,
> >> >> obviously we can start a debate about the suitability of this type
of
> >> >> experimentation (concept of relevance, pooling, relevance
> >> judgements),
> >> >> but
> >> >> this is a much more complex topic and I believe is far from what we
> >> are
> >> >> dealing here.
> >> >>
> >> >> PS2: In relation with TREC4 Cornell used a pivoted length
> >> normalisation
> >> >> and they were applying pseudo-relevance feedback, what honestly makes
> >> >> much
> >> >> more difficult the analysis of the results. Obviously their results
> >> were
> >> >> part of the pool.
> >> >>
> >> >> Sorry for the huge mail :-))))
> >> >>
> >> >> > Hi Ivan,
> >> >> >
> >> >> > the problem is that unfortunately BM25
> >> >> > cannot be implemented overwriting
> >> >> > the Similarity interface. Therefore BM25Similarity
> >> >> > only computes the classic probabilistic IDF (what is
> >> >> > interesting only at search time).
> >> >> > If you set BM25Similarity at indexing time
> >> >> > some basic stats are not stored
> >> >> > correctly in the segments (like docs length).
> >> >> >
> >> >> > When you use BM25BooleanQuery this class
> >> >> > will set automatically the BM25Similarity for you,
> >> >> > therefore you don't need to do this explicitly.
> >> >> >
> >> >> > I tried to make this implementation with the focus on
> >> >> > not interfering on the typical use of Lucene (so no changing
> >> >> > DefaultSimilarity).
> >> >> >
> >> >> >> Joaquin, Robert,
> >> >> >>
> >> >> >> I followed Joaquin's recommendation and removed the call to
set
> >> >> >> similarity
> >> >> >> to BM25 explicitly (indexer, searcher).  The results showed
55%
> >> >> >> improvement for the MAP score (0.141->0.219) over default
> >> similarity.
> >> >> >>
> >> >> >> Joaquin, how would setting the similarity to BM25 explicitly
make
> >> the
> >> >> >> score worse?
> >> >> >>
> >> >> >> Thank you,
> >> >> >>
> >> >> >> Ivan
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --- On Tue, 2/16/10, Robert Muir <rcmuir@gmail.com>
wrote:
> >> >> >>
> >> >> >>> From: Robert Muir <rcmuir@gmail.com>
> >> >> >>> Subject: Re: BM25 Scoring Patch
> >> >> >>> To: java-user@lucene.apache.org
> >> >> >>> Date: Tuesday, February 16, 2010, 11:36 AM
> >> >> >>> yes Ivan, if possible please report
> >> >> >>> back any findings you can on the
> >> >> >>> experiments you are doing!
> >> >> >>>
> >> >> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> >> >> >>> <
> >> >> >>> joaquin.perez@lsi.uned.es>
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>> > Hi Ivan,
> >> >> >>> >
> >> >> >>> > You shouldn't set the BM25Similarity for indexing
or
> >> >> >>> searching.
> >> >> >>> > Please try removing the lines:
> >> >> >>> >   writer.setSimilarity(new
> >> >> >>> BM25Similarity());
> >> >> >>> >   searcher.setSimilarity(sim);
> >> >> >>> >
> >> >> >>> > Please let us/me know if you improve your results
with
> >> >> >>> these changes.
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > Robert Muir escribió:
> >> >> >>> >
> >> >> >>> >  Hi Ivan, I've seen many cases where BM25
> >> >> >>> performs worse than Lucene's
> >> >> >>> >> default Similarity. Perhaps this is just another
> >> >> >>> one?
> >> >> >>> >>
> >> >> >>> >> Again while I have not worked with this particular
> >> >> >>> collection, I looked at
> >> >> >>> >> the statistics and noted that its composed of
> >> >> >>> several 'sub-collections':
> >> >> >>> >> for
> >> >> >>> >> example the PAT documents on disk 3 have an
> >> >> >>> average doc length of 3543,
> >> >> >>> >> but
> >> >> >>> >> the AP documents on disk 1 have an avg doc length
> >> >> >>> of 353.
> >> >> >>> >>
> >> >> >>> >> I have found on other collections that any
> >> >> >>> advantages of BM25's document
> >> >> >>> >> length normalization fall apart when 'average
> >> >> >>> document length' doesn't
> >> >> >>> >> make
> >> >> >>> >> a whole lot of sense (cases like this).
> >> >> >>> >>
> >> >> >>> >> For this same reason, I've only found a few
> >> >> >>> collections where BM25's doc
> >> >> >>> >> length normalization is really significantly
> >> >> >>> better than Lucene's.
> >> >> >>> >>
> >> >> >>> >> In my opinion, the results on a particular test
> >> >> >>> collection or 2 have
> >> >> >>> >> perhaps
> >> >> >>> >> been taken too far and created a myth that BM25
is
> >> >> >>> always superior to
> >> >> >>> >> Lucene's scoring... this is not true!
> >> >> >>> >>
> >> >> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> >> >> >>> <iprovalo@yahoo.com>
> >> >> >>> >> wrote:
> >> >> >>> >>
> >> >> >>> >>  I applied the Lucene patch mentioned in
> >> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091
and
> >> >> >>> ran the MAP
> >> >> >>> >>> numbers
> >> >> >>> >>> on TREC-3 collection using topics
> >> >> >>> 151-200.  I am not getting worse
> >> >> >>> >>> results
> >> >> >>> >>> comparing to Lucene DefaultSimilarity.  I
> >> >> >>> suspect, I am not using it
> >> >> >>> >>> correctly.  I have single field
> >> >> >>> documents.  This is the process I use:
> >> >> >>> >>>
> >> >> >>> >>> 1. During the indexing, I am setting the
> >> >> >>> similarity to BM25 as such:
> >> >> >>> >>>
> >> >> >>> >>> IndexWriter writer = new IndexWriter(dir,
new
> >> >> >>> StandardAnalyzer(
> >> >> >>> >>>
> >> >> >>>    Version.LUCENE_CURRENT), true,
> >> >> >>> >>>
> >> >> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
> >> >> >>> >>> writer.setSimilarity(new BM25Similarity());
> >> >> >>> >>>
> >> >> >>> >>> 2. During the Precision/Recall measurements,
I
> >> >> >>> am using a
> >> >> >>> >>> SimpleBM25QQParser extension I added to the
> >> >> >>> benchmark:
> >> >> >>> >>>
> >> >> >>> >>> QualityQueryParser qqParser = new
> >> >> >>> SimpleBM25QQParser("title", "TEXT");
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>> 3. Here is the parser code (I set an avg
doc
> >> >> >>> length here):
> >> >> >>> >>>
> >> >> >>> >>> public Query parse(QualityQuery qq) throws
> >> >> >>> ParseException {
> >> >> >>> >>>   BM25Parameters.setAverageLength(indexField,
> >> >> >>> 798.30f);//avg doc length
> >> >> >>> >>>   BM25Parameters.setB(0.5f);//tried
> >> >> >>> default values
> >> >> >>> >>>   BM25Parameters.setK1(2f);
> >> >> >>> >>>   return query = new
> >> >> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
> >> >> >>> >>> new
> >> >> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> >> >> >>> >>> }
> >> >> >>> >>>
> >> >> >>> >>> 4. The searcher is using BM25 similarity:
> >> >> >>> >>>
> >> >> >>> >>> Searcher searcher = new IndexSearcher(dir,
> >> >> >>> true);
> >> >> >>> >>> searcher.setSimilarity(sim);
> >> >> >>> >>>
> >> >> >>> >>> Am I missing some steps?  Does anyone
> >> >> >>> have experience with this code?
> >> >> >>> >>>
> >> >> >>> >>> Thanks,
> >> >> >>> >>>
> >> >> >>> >>> Ivan
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>>
> >> ---------------------------------------------------------------------
> >> >> >>> >>> To unsubscribe, e-mail:
> >> java-user-unsubscribe@lucene.apache.org
> >> >> >>> >>> For additional commands, e-mail:
> >> >> java-user-help@lucene.apache.org
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> > --
> >> >> >>> >
> >> >> >>> -----------------------------------------------------------
> >> >> >>> > Joaquín Pérez Iglesias
> >> >> >>> > Dpto. Lenguajes y Sistemas Informáticos
> >> >> >>> > E.T.S.I. Informática (UNED)
> >> >> >>> > Ciudad Universitaria
> >> >> >>> > C/ Juan del Rosal nº 16
> >> >> >>> > 28040 Madrid - Spain
> >> >> >>> > Phone. +34 91 398 89 19
> >> >> >>> > Fax    +34 91 398 65 35
> >> >> >>> > Office  2.11
> >> >> >>> > Email: joaquin.perez@lsi.uned.es
> >> >> >>> > web:
> >> http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/><
> http://nlp.uned.es/%7Ejperezi/>
> >> >> <http://nlp.uned.es/%7Ejperezi/><
> >> >> http://nlp.uned.es/%7Ejperezi/>
> >> >> >>> >
> >> >> >>> -----------------------------------------------------------
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>>
> >> ---------------------------------------------------------------------
> >> >> >>> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> >> >> >>> > For additional commands, e-mail:
> >> java-user-help@lucene.apache.org
> >> >> >>> >
> >> >> >>> >
> >> >> >>>
> >> >> >>>
> >> >> >>> --
> >> >> >>> Robert Muir
> >> >> >>> rcmuir@gmail.com
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > rcmuir@gmail.com
> >> >
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Robert Muir
rcmuir@gmail.com
Mime
View raw message