lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "JOAQUIN PEREZ IGLESIAS" <joaquin.pe...@lsi.uned.es>
Subject Re: BM25 Scoring Patch
Date Tue, 16 Feb 2010 18:53:56 GMT
By the way,

I don't want to start a flame war VSM vs BM25, but I really believe that I
have to express my opinion as Robert has done. In my experience, I have
never found a case where VSM improves significantly BM25. Maybe you can
find some cases under some very specific collection characteristics, (as
average length of 300 vs 3000) or a bad usage of BM25 (not proper
parameters) where it can happen.

BM25 is not just only a different way of length normalization, it is based
strongly in the probabilistic framework, and parametrises frequencies and
length. This is probably the most successful ranking model of the last
years in Information Retrieval.

I have never read a paper where VSM  improves any of the state-of-the-art
ranking models (Language Models, DFR, BM25,...),  although the VSM with
pivoted normalisation length can obtain nice results. This can be proved
checking the last years of the TREC competition.

Honestly to say that is a myth that BM25 improves VSM breaks the last 10
or 15 years of research on Information Retrieval, and I really believe
that is not accurate.

The good thing of Information Retrieval is that you can always make your
owns experiments and you can use the experience of a lot of years of
research.

PS: This opinion is based on experiments on TREC and CLEF collections,
obviously we can start a debate about the suitability of this type of
experimentation (concept of relevance, pooling, relevance judgements), but
this is a much more complex topic and I believe is far from what we are
dealing here.

PS2: In relation with TREC4 Cornell used a pivoted length normalisation
and they were applying pseudo-relevance feedback, what honestly makes much
more difficult the analysis of the results. Obviously their results were
part of the pool.

Sorry for the huge mail :-))))

> Hi Ivan,
>
> the problem is that unfortunately BM25
> cannot be implemented overwriting
> the Similarity interface. Therefore BM25Similarity
> only computes the classic probabilistic IDF (what is
> interesting only at search time).
> If you set BM25Similarity at indexing time
> some basic stats are not stored
> correctly in the segments (like docs length).
>
> When you use BM25BooleanQuery this class
> will set automatically the BM25Similarity for you,
> therefore you don't need to do this explicitly.
>
> I tried to make this implementation with the focus on
> not interfering on the typical use of Lucene (so no changing
> DefaultSimilarity).
>
>> Joaquin, Robert,
>>
>> I followed Joaquin's recommendation and removed the call to set
>> similarity
>> to BM25 explicitly (indexer, searcher).  The results showed 55%
>> improvement for the MAP score (0.141->0.219) over default similarity.
>>
>> Joaquin, how would setting the similarity to BM25 explicitly make the
>> score worse?
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>
>> --- On Tue, 2/16/10, Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> From: Robert Muir <rcmuir@gmail.com>
>>> Subject: Re: BM25 Scoring Patch
>>> To: java-user@lucene.apache.org
>>> Date: Tuesday, February 16, 2010, 11:36 AM
>>> yes Ivan, if possible please report
>>> back any findings you can on the
>>> experiments you are doing!
>>>
>>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
>>> <
>>> joaquin.perez@lsi.uned.es>
>>> wrote:
>>>
>>> > Hi Ivan,
>>> >
>>> > You shouldn't set the BM25Similarity for indexing or
>>> searching.
>>> > Please try removing the lines:
>>> >   writer.setSimilarity(new
>>> BM25Similarity());
>>> >   searcher.setSimilarity(sim);
>>> >
>>> > Please let us/me know if you improve your results with
>>> these changes.
>>> >
>>> >
>>> > Robert Muir escribió:
>>> >
>>> >  Hi Ivan, I've seen many cases where BM25
>>> performs worse than Lucene's
>>> >> default Similarity. Perhaps this is just another
>>> one?
>>> >>
>>> >> Again while I have not worked with this particular
>>> collection, I looked at
>>> >> the statistics and noted that its composed of
>>> several 'sub-collections':
>>> >> for
>>> >> example the PAT documents on disk 3 have an
>>> average doc length of 3543,
>>> >> but
>>> >> the AP documents on disk 1 have an avg doc length
>>> of 353.
>>> >>
>>> >> I have found on other collections that any
>>> advantages of BM25's document
>>> >> length normalization fall apart when 'average
>>> document length' doesn't
>>> >> make
>>> >> a whole lot of sense (cases like this).
>>> >>
>>> >> For this same reason, I've only found a few
>>> collections where BM25's doc
>>> >> length normalization is really significantly
>>> better than Lucene's.
>>> >>
>>> >> In my opinion, the results on a particular test
>>> collection or 2 have
>>> >> perhaps
>>> >> been taken too far and created a myth that BM25 is
>>> always superior to
>>> >> Lucene's scoring... this is not true!
>>> >>
>>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
>>> <iprovalo@yahoo.com>
>>> >> wrote:
>>> >>
>>> >>  I applied the Lucene patch mentioned in
>>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
>>> ran the MAP
>>> >>> numbers
>>> >>> on TREC-3 collection using topics
>>> 151-200.  I am not getting worse
>>> >>> results
>>> >>> comparing to Lucene DefaultSimilarity.  I
>>> suspect, I am not using it
>>> >>> correctly.  I have single field
>>> documents.  This is the process I use:
>>> >>>
>>> >>> 1. During the indexing, I am setting the
>>> similarity to BM25 as such:
>>> >>>
>>> >>> IndexWriter writer = new IndexWriter(dir, new
>>> StandardAnalyzer(
>>> >>>
>>>    Version.LUCENE_CURRENT), true,
>>> >>>
>>>    IndexWriter.MaxFieldLength.UNLIMITED);
>>> >>> writer.setSimilarity(new BM25Similarity());
>>> >>>
>>> >>> 2. During the Precision/Recall measurements, I
>>> am using a
>>> >>> SimpleBM25QQParser extension I added to the
>>> benchmark:
>>> >>>
>>> >>> QualityQueryParser qqParser = new
>>> SimpleBM25QQParser("title", "TEXT");
>>> >>>
>>> >>>
>>> >>> 3. Here is the parser code (I set an avg doc
>>> length here):
>>> >>>
>>> >>> public Query parse(QualityQuery qq) throws
>>> ParseException {
>>> >>>   BM25Parameters.setAverageLength(indexField,
>>> 798.30f);//avg doc length
>>> >>>   BM25Parameters.setB(0.5f);//tried
>>> default values
>>> >>>   BM25Parameters.setK1(2f);
>>> >>>   return query = new
>>> BM25BooleanQuery(qq.getValue(qqName), indexField,
>>> >>> new
>>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
>>> >>> }
>>> >>>
>>> >>> 4. The searcher is using BM25 similarity:
>>> >>>
>>> >>> Searcher searcher = new IndexSearcher(dir,
>>> true);
>>> >>> searcher.setSimilarity(sim);
>>> >>>
>>> >>> Am I missing some steps?  Does anyone
>>> have experience with this code?
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Ivan
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> > --
>>> >
>>> -----------------------------------------------------------
>>> > Joaquín Pérez Iglesias
>>> > Dpto. Lenguajes y Sistemas Informáticos
>>> > E.T.S.I. Informática (UNED)
>>> > Ciudad Universitaria
>>> > C/ Juan del Rosal nº 16
>>> > 28040 Madrid - Spain
>>> > Phone. +34 91 398 89 19
>>> > Fax    +34 91 398 65 35
>>> > Office  2.11
>>> > Email: joaquin.perez@lsi.uned.es
>>> > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/>
>>> >
>>> -----------------------------------------------------------
>>> >
>>> >
>>> >
>>> ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >
>>> >
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message