lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: BM25 Scoring Patch
Date Tue, 16 Feb 2010 20:37:43 GMT
Joaquin, I have a typical methodology where I don't optimize any scoring
params: be it BM25 params (I stick with your defaults), or lnb.ltc params (i
stick with default slope). When doing query expansion i don't modify the
defaults for MoreLikeThis either.

I've found that changing these params can have a significant difference in
retrieval performance, which is interesting, but I'm typically focused on
text analysis (how is the text indexed?/stemming/stopwords). I also feel
that such things are corpus-specific, which i generally try to avoid in my
work...

for example, in analysis work,  the text collection often has a majority of
text in a specific tense (i.e. news), so i don't at all try to tune any part
of analysis as I worry this would be corpus-specific... I do the same with
scoring.

As far as why some models perform better than others for certain languages,
I think this is a million-dollar question. But my intuition (I don't have
references or anything to back this up), is that probabilistic models
outperform vector-space models when you are using approaches like n-grams:
you don't have nice stopwords lists, stemming, decompounding etc.

This is particularly interesting to me, as probabilistic model + ngram is a
very general multilingual approach that I would like to have working well in
Lucene, its also important as a "default" when we don't have a nicely tuned
analyzer available that will work well with a vector space model. In my
opinion, vector-space tends to fall apart without good language support.


On Tue, Feb 16, 2010 at 3:23 PM, JOAQUIN PEREZ IGLESIAS <
joaquin.perez@lsi.uned.es> wrote:

> Ok,
>
> I'm not advocating the BM25 patch neither, unfortunately BM25 was not my
> idea :-))), and I'm sure that the implementation can be improved.
>
> When you use the BM25 implementation, are you optimising the parameters
> specifically per collection? (It is a key factor for improving BM25
> performance).
>
> Why do you think that BM25 works better for English than in other
> languages (apart of experiments). What are your intuitions?
>
> I dont't have too much experience on languages moreover of Spanish and
> English, and it sounds pretty interesting.
>
> Kind Regards.
>
> P.S: Maybe this is not a topic for this list???
>
>
> > Joaquin, I don't see this as a flame war? First of all I'd like to
> > personally thank you for your excellent BM25 implementation!
> >
> > I think the selection of a retrieval model depends highly on the
> > language/indexing approach, i.e. if we were talking East Asian languages
> I
> > think we want a probabilistic model: no argument there!
> >
> > All i said was that it is a myth that BM25 is "always" better than
> > Lucene's
> > scoring model, it really depends on what you are trying to do, how you
> are
> > indexing your text, properties of your corpus, how your queries are
> > running.
> >
> > I don't even want to come across as advocating the lnb.ltc approach
> > either,
> > sure I wrote the patch, but this means nothing. I only like it as its
> > currently a simple integration into Lucene, but long-term its best if we
> > can
> > support other models also!
> >
> > Finally I think there is something to be said for Lucene's default
> > retrieval
> > model, which in my (non-english) findings across the board isn't terrible
> > at
> > all... then again I am working with languages where analysis is really
> the
> > thing holding Lucene back, not scoring.
> >
> > On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN PEREZ IGLESIAS <
> > joaquin.perez@lsi.uned.es> wrote:
> >
> >> Just some final comments (as I said I'm not interested in flame wars),
> >>
> >> If I obtain better results there are not problem with pooling otherwise
> >> it
> >> is biased.
> >> The only important thing (in my opinion) is that it cannot be said that
> >> BM25 is a myth.
> >> Yes, you are right there is not an only ranking model that beats the
> >> rest,
> >> but there are models that generally show a better performance in more
> >> cases.
> >>
> >> About CLEF I have had the same experience (VSM vs BM25) on Spanish and
> >> English (WebCLEF) and Q&A (ResPubliQA)
> >>
> >> Ivan checks the parameters (b and k1), probably you can improve your
> >> results. (that's the bad part of BM25).
> >>
> >> Finally we are just speaking of personal experience, so obviously you
> >> should use the best model for your data and your own experience, on IR
> >> there are not myths neither best ranking models. If any of us is able to
> >> find the &#8220;best&#8221;  ranking model, or is able to prove that
any
> >> state-of-the art is a myth he should send these results to the SIGIR
> >> conference.
> >>
> >> Ivan, Robert good luck with your experiments, as I said the good part of
> >> IR is that you can always make experiments on your own.
> >>
> >> > I don't think its really a competition, I think preferably we should
> >> have
> >> > the flexibility to change the scoring model in lucene actually?
> >> >
> >> > I have found lots of cases where VSM improves on BM25, but then again
> >> I
> >> > don't work with TREC stuff, as I work with non-english collections.
> >> >
> >> > It doesn't contradict years of research to say that VSM isn't a
> >> > state-of-the-art model, besides the TREC-4 results, there are CLEF
> >> results
> >> > where VSM models perform competitively or exceed (Finnish, Russian,
> >> etc)
> >> > BM25/DFR/etc.
> >> >
> >> > It depends on the collection, there isn't a 'best retrieval formula'.
> >> >
> >> > Note: I have no bias against BM-25, but its definitely a myth to say
> >> there
> >> > is a single retrieval formula that is the 'best' across the board.
> >> >
> >> >
> >> > On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
> >> > joaquin.perez@lsi.uned.es> wrote:
> >> >
> >> >> By the way,
> >> >>
> >> >> I don't want to start a flame war VSM vs BM25, but I really believe
> >> that
> >> >> I
> >> >> have to express my opinion as Robert has done. In my experience, I
> >> have
> >> >> never found a case where VSM improves significantly BM25. Maybe you
> >> can
> >> >> find some cases under some very specific collection characteristics,
> >> (as
> >> >> average length of 300 vs 3000) or a bad usage of BM25 (not proper
> >> >> parameters) where it can happen.
> >> >>
> >> >> BM25 is not just only a different way of length normalization, it is
> >> >> based
> >> >> strongly in the probabilistic framework, and parametrises frequencies
> >> >> and
> >> >> length. This is probably the most successful ranking model of the
> >> last
> >> >> years in Information Retrieval.
> >> >>
> >> >> I have never read a paper where VSM  improves any of the
> >> >> state-of-the-art
> >> >> ranking models (Language Models, DFR, BM25,...),  although the VSM
> >> with
> >> >> pivoted normalisation length can obtain nice results. This can be
> >> proved
> >> >> checking the last years of the TREC competition.
> >> >>
> >> >> Honestly to say that is a myth that BM25 improves VSM breaks the last
> >> 10
> >> >> or 15 years of research on Information Retrieval, and I really
> >> believe
> >> >> that is not accurate.
> >> >>
> >> >> The good thing of Information Retrieval is that you can always make
> >> your
> >> >> owns experiments and you can use the experience of a lot of years of
> >> >> research.
> >> >>
> >> >> PS: This opinion is based on experiments on TREC and CLEF
> >> collections,
> >> >> obviously we can start a debate about the suitability of this type
of
> >> >> experimentation (concept of relevance, pooling, relevance
> >> judgements),
> >> >> but
> >> >> this is a much more complex topic and I believe is far from what we
> >> are
> >> >> dealing here.
> >> >>
> >> >> PS2: In relation with TREC4 Cornell used a pivoted length
> >> normalisation
> >> >> and they were applying pseudo-relevance feedback, what honestly makes
> >> >> much
> >> >> more difficult the analysis of the results. Obviously their results
> >> were
> >> >> part of the pool.
> >> >>
> >> >> Sorry for the huge mail :-))))
> >> >>
> >> >> > Hi Ivan,
> >> >> >
> >> >> > the problem is that unfortunately BM25
> >> >> > cannot be implemented overwriting
> >> >> > the Similarity interface. Therefore BM25Similarity
> >> >> > only computes the classic probabilistic IDF (what is
> >> >> > interesting only at search time).
> >> >> > If you set BM25Similarity at indexing time
> >> >> > some basic stats are not stored
> >> >> > correctly in the segments (like docs length).
> >> >> >
> >> >> > When you use BM25BooleanQuery this class
> >> >> > will set automatically the BM25Similarity for you,
> >> >> > therefore you don't need to do this explicitly.
> >> >> >
> >> >> > I tried to make this implementation with the focus on
> >> >> > not interfering on the typical use of Lucene (so no changing
> >> >> > DefaultSimilarity).
> >> >> >
> >> >> >> Joaquin, Robert,
> >> >> >>
> >> >> >> I followed Joaquin's recommendation and removed the call to
set
> >> >> >> similarity
> >> >> >> to BM25 explicitly (indexer, searcher).  The results showed
55%
> >> >> >> improvement for the MAP score (0.141->0.219) over default
> >> similarity.
> >> >> >>
> >> >> >> Joaquin, how would setting the similarity to BM25 explicitly
make
> >> the
> >> >> >> score worse?
> >> >> >>
> >> >> >> Thank you,
> >> >> >>
> >> >> >> Ivan
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --- On Tue, 2/16/10, Robert Muir <rcmuir@gmail.com>
wrote:
> >> >> >>
> >> >> >>> From: Robert Muir <rcmuir@gmail.com>
> >> >> >>> Subject: Re: BM25 Scoring Patch
> >> >> >>> To: java-user@lucene.apache.org
> >> >> >>> Date: Tuesday, February 16, 2010, 11:36 AM
> >> >> >>> yes Ivan, if possible please report
> >> >> >>> back any findings you can on the
> >> >> >>> experiments you are doing!
> >> >> >>>
> >> >> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> >> >> >>> <
> >> >> >>> joaquin.perez@lsi.uned.es>
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>> > Hi Ivan,
> >> >> >>> >
> >> >> >>> > You shouldn't set the BM25Similarity for indexing
or
> >> >> >>> searching.
> >> >> >>> > Please try removing the lines:
> >> >> >>> >   writer.setSimilarity(new
> >> >> >>> BM25Similarity());
> >> >> >>> >   searcher.setSimilarity(sim);
> >> >> >>> >
> >> >> >>> > Please let us/me know if you improve your results
with
> >> >> >>> these changes.
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > Robert Muir escribió:
> >> >> >>> >
> >> >> >>> >  Hi Ivan, I've seen many cases where BM25
> >> >> >>> performs worse than Lucene's
> >> >> >>> >> default Similarity. Perhaps this is just another
> >> >> >>> one?
> >> >> >>> >>
> >> >> >>> >> Again while I have not worked with this particular
> >> >> >>> collection, I looked at
> >> >> >>> >> the statistics and noted that its composed of
> >> >> >>> several 'sub-collections':
> >> >> >>> >> for
> >> >> >>> >> example the PAT documents on disk 3 have an
> >> >> >>> average doc length of 3543,
> >> >> >>> >> but
> >> >> >>> >> the AP documents on disk 1 have an avg doc length
> >> >> >>> of 353.
> >> >> >>> >>
> >> >> >>> >> I have found on other collections that any
> >> >> >>> advantages of BM25's document
> >> >> >>> >> length normalization fall apart when 'average
> >> >> >>> document length' doesn't
> >> >> >>> >> make
> >> >> >>> >> a whole lot of sense (cases like this).
> >> >> >>> >>
> >> >> >>> >> For this same reason, I've only found a few
> >> >> >>> collections where BM25's doc
> >> >> >>> >> length normalization is really significantly
> >> >> >>> better than Lucene's.
> >> >> >>> >>
> >> >> >>> >> In my opinion, the results on a particular test
> >> >> >>> collection or 2 have
> >> >> >>> >> perhaps
> >> >> >>> >> been taken too far and created a myth that BM25
is
> >> >> >>> always superior to
> >> >> >>> >> Lucene's scoring... this is not true!
> >> >> >>> >>
> >> >> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> >> >> >>> <iprovalo@yahoo.com>
> >> >> >>> >> wrote:
> >> >> >>> >>
> >> >> >>> >>  I applied the Lucene patch mentioned in
> >> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091
and
> >> >> >>> ran the MAP
> >> >> >>> >>> numbers
> >> >> >>> >>> on TREC-3 collection using topics
> >> >> >>> 151-200.  I am not getting worse
> >> >> >>> >>> results
> >> >> >>> >>> comparing to Lucene DefaultSimilarity.  I
> >> >> >>> suspect, I am not using it
> >> >> >>> >>> correctly.  I have single field
> >> >> >>> documents.  This is the process I use:
> >> >> >>> >>>
> >> >> >>> >>> 1. During the indexing, I am setting the
> >> >> >>> similarity to BM25 as such:
> >> >> >>> >>>
> >> >> >>> >>> IndexWriter writer = new IndexWriter(dir,
new
> >> >> >>> StandardAnalyzer(
> >> >> >>> >>>
> >> >> >>>    Version.LUCENE_CURRENT), true,
> >> >> >>> >>>
> >> >> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
> >> >> >>> >>> writer.setSimilarity(new BM25Similarity());
> >> >> >>> >>>
> >> >> >>> >>> 2. During the Precision/Recall measurements,
I
> >> >> >>> am using a
> >> >> >>> >>> SimpleBM25QQParser extension I added to the
> >> >> >>> benchmark:
> >> >> >>> >>>
> >> >> >>> >>> QualityQueryParser qqParser = new
> >> >> >>> SimpleBM25QQParser("title", "TEXT");
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>> 3. Here is the parser code (I set an avg
doc
> >> >> >>> length here):
> >> >> >>> >>>
> >> >> >>> >>> public Query parse(QualityQuery qq) throws
> >> >> >>> ParseException {
> >> >> >>> >>>   BM25Parameters.setAverageLength(indexField,
> >> >> >>> 798.30f);//avg doc length
> >> >> >>> >>>   BM25Parameters.setB(0.5f);//tried
> >> >> >>> default values
> >> >> >>> >>>   BM25Parameters.setK1(2f);
> >> >> >>> >>>   return query = new
> >> >> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
> >> >> >>> >>> new
> >> >> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> >> >> >>> >>> }
> >> >> >>> >>>
> >> >> >>> >>> 4. The searcher is using BM25 similarity:
> >> >> >>> >>>
> >> >> >>> >>> Searcher searcher = new IndexSearcher(dir,
> >> >> >>> true);
> >> >> >>> >>> searcher.setSimilarity(sim);
> >> >> >>> >>>
> >> >> >>> >>> Am I missing some steps?  Does anyone
> >> >> >>> have experience with this code?
> >> >> >>> >>>
> >> >> >>> >>> Thanks,
> >> >> >>> >>>
> >> >> >>> >>> Ivan
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>>
> >> ---------------------------------------------------------------------
> >> >> >>> >>> To unsubscribe, e-mail:
> >> java-user-unsubscribe@lucene.apache.org
> >> >> >>> >>> For additional commands, e-mail:
> >> >> java-user-help@lucene.apache.org
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> > --
> >> >> >>> >
> >> >> >>> -----------------------------------------------------------
> >> >> >>> > Joaquín Pérez Iglesias
> >> >> >>> > Dpto. Lenguajes y Sistemas Informáticos
> >> >> >>> > E.T.S.I. Informática (UNED)
> >> >> >>> > Ciudad Universitaria
> >> >> >>> > C/ Juan del Rosal nº 16
> >> >> >>> > 28040 Madrid - Spain
> >> >> >>> > Phone. +34 91 398 89 19
> >> >> >>> > Fax    +34 91 398 65 35
> >> >> >>> > Office  2.11
> >> >> >>> > Email: joaquin.perez@lsi.uned.es
> >> >> >>> > web:
> >> http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/><
> http://nlp.uned.es/%7Ejperezi/>
> >> >> <http://nlp.uned.es/%7Ejperezi/><
> >> >> http://nlp.uned.es/%7Ejperezi/>
> >> >> >>> >
> >> >> >>> -----------------------------------------------------------
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>>
> >> ---------------------------------------------------------------------
> >> >> >>> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> >> >> >>> > For additional commands, e-mail:
> >> java-user-help@lucene.apache.org
> >> >> >>> >
> >> >> >>> >
> >> >> >>>
> >> >> >>>
> >> >> >>> --
> >> >> >>> Robert Muir
> >> >> >>> rcmuir@gmail.com
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > rcmuir@gmail.com
> >> >
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message