lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "JOAQUIN PEREZ IGLESIAS" <joaquin.pe...@lsi.uned.es>
Subject Re: BM25 Scoring Patch
Date Tue, 16 Feb 2010 17:58:03 GMT
Hi Ivan,

the problem is that unfortunately BM25
cannot be implemented overwriting
the Similarity interface. Therefore BM25Similarity
only computes the classic probabilistic IDF (what is
interesting only at search time).
If you set BM25Similarity at indexing time
some basic stats are not stored
correctly in the segments (like docs length).

When you use BM25BooleanQuery this class
will set automatically the BM25Similarity for you,
therefore you don't need to do this explicitly.

I tried to make this implementation with the focus on
not interfering on the typical use of Lucene (so no changing
DefaultSimilarity).

> Joaquin, Robert,
>
> I followed Joaquin's recommendation and removed the call to set similarity
> to BM25 explicitly (indexer, searcher).  The results showed 55%
> improvement for the MAP score (0.141->0.219) over default similarity.
>
> Joaquin, how would setting the similarity to BM25 explicitly make the
> score worse?
>
> Thank you,
>
> Ivan
>
>
>
> --- On Tue, 2/16/10, Robert Muir <rcmuir@gmail.com> wrote:
>
>> From: Robert Muir <rcmuir@gmail.com>
>> Subject: Re: BM25 Scoring Patch
>> To: java-user@lucene.apache.org
>> Date: Tuesday, February 16, 2010, 11:36 AM
>> yes Ivan, if possible please report
>> back any findings you can on the
>> experiments you are doing!
>>
>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
>> <
>> joaquin.perez@lsi.uned.es>
>> wrote:
>>
>> > Hi Ivan,
>> >
>> > You shouldn't set the BM25Similarity for indexing or
>> searching.
>> > Please try removing the lines:
>> >   writer.setSimilarity(new
>> BM25Similarity());
>> >   searcher.setSimilarity(sim);
>> >
>> > Please let us/me know if you improve your results with
>> these changes.
>> >
>> >
>> > Robert Muir escribió:
>> >
>> >  Hi Ivan, I've seen many cases where BM25
>> performs worse than Lucene's
>> >> default Similarity. Perhaps this is just another
>> one?
>> >>
>> >> Again while I have not worked with this particular
>> collection, I looked at
>> >> the statistics and noted that its composed of
>> several 'sub-collections':
>> >> for
>> >> example the PAT documents on disk 3 have an
>> average doc length of 3543,
>> >> but
>> >> the AP documents on disk 1 have an avg doc length
>> of 353.
>> >>
>> >> I have found on other collections that any
>> advantages of BM25's document
>> >> length normalization fall apart when 'average
>> document length' doesn't
>> >> make
>> >> a whole lot of sense (cases like this).
>> >>
>> >> For this same reason, I've only found a few
>> collections where BM25's doc
>> >> length normalization is really significantly
>> better than Lucene's.
>> >>
>> >> In my opinion, the results on a particular test
>> collection or 2 have
>> >> perhaps
>> >> been taken too far and created a myth that BM25 is
>> always superior to
>> >> Lucene's scoring... this is not true!
>> >>
>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
>> <iprovalo@yahoo.com>
>> >> wrote:
>> >>
>> >>  I applied the Lucene patch mentioned in
>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
>> ran the MAP
>> >>> numbers
>> >>> on TREC-3 collection using topics
>> 151-200.  I am not getting worse
>> >>> results
>> >>> comparing to Lucene DefaultSimilarity.  I
>> suspect, I am not using it
>> >>> correctly.  I have single field
>> documents.  This is the process I use:
>> >>>
>> >>> 1. During the indexing, I am setting the
>> similarity to BM25 as such:
>> >>>
>> >>> IndexWriter writer = new IndexWriter(dir, new
>> StandardAnalyzer(
>> >>>
>>    Version.LUCENE_CURRENT), true,
>> >>>
>>    IndexWriter.MaxFieldLength.UNLIMITED);
>> >>> writer.setSimilarity(new BM25Similarity());
>> >>>
>> >>> 2. During the Precision/Recall measurements, I
>> am using a
>> >>> SimpleBM25QQParser extension I added to the
>> benchmark:
>> >>>
>> >>> QualityQueryParser qqParser = new
>> SimpleBM25QQParser("title", "TEXT");
>> >>>
>> >>>
>> >>> 3. Here is the parser code (I set an avg doc
>> length here):
>> >>>
>> >>> public Query parse(QualityQuery qq) throws
>> ParseException {
>> >>>   BM25Parameters.setAverageLength(indexField,
>> 798.30f);//avg doc length
>> >>>   BM25Parameters.setB(0.5f);//tried
>> default values
>> >>>   BM25Parameters.setK1(2f);
>> >>>   return query = new
>> BM25BooleanQuery(qq.getValue(qqName), indexField,
>> >>> new
>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
>> >>> }
>> >>>
>> >>> 4. The searcher is using BM25 similarity:
>> >>>
>> >>> Searcher searcher = new IndexSearcher(dir,
>> true);
>> >>> searcher.setSimilarity(sim);
>> >>>
>> >>> Am I missing some steps?  Does anyone
>> have experience with this code?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Ivan
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> > --
>> >
>> -----------------------------------------------------------
>> > Joaquín Pérez Iglesias
>> > Dpto. Lenguajes y Sistemas Informáticos
>> > E.T.S.I. Informática (UNED)
>> > Ciudad Universitaria
>> > C/ Juan del Rosal nº 16
>> > 28040 Madrid - Spain
>> > Phone. +34 91 398 89 19
>> > Fax    +34 91 398 65 35
>> > Office  2.11
>> > Email: joaquin.perez@lsi.uned.es
>> > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/>
>> >
>> -----------------------------------------------------------
>> >
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message