lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: BM25 Scoring Patch
Date Tue, 16 Feb 2010 16:36:46 GMT
yes Ivan, if possible please report back any findings you can on the
experiments you are doing!

On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias <
joaquin.perez@lsi.uned.es> wrote:

> Hi Ivan,
>
> You shouldn't set the BM25Similarity for indexing or searching.
> Please try removing the lines:
>   writer.setSimilarity(new BM25Similarity());
>   searcher.setSimilarity(sim);
>
> Please let us/me know if you improve your results with these changes.
>
>
> Robert Muir escribió:
>
>  Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's
>> default Similarity. Perhaps this is just another one?
>>
>> Again while I have not worked with this particular collection, I looked at
>> the statistics and noted that its composed of several 'sub-collections':
>> for
>> example the PAT documents on disk 3 have an average doc length of 3543,
>> but
>> the AP documents on disk 1 have an avg doc length of 353.
>>
>> I have found on other collections that any advantages of BM25's document
>> length normalization fall apart when 'average document length' doesn't
>> make
>> a whole lot of sense (cases like this).
>>
>> For this same reason, I've only found a few collections where BM25's doc
>> length normalization is really significantly better than Lucene's.
>>
>> In my opinion, the results on a particular test collection or 2 have
>> perhaps
>> been taken too far and created a myth that BM25 is always superior to
>> Lucene's scoring... this is not true!
>>
>> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov <iprovalo@yahoo.com>
>> wrote:
>>
>>  I applied the Lucene patch mentioned in
>>> https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP
>>> numbers
>>> on TREC-3 collection using topics 151-200.  I am not getting worse
>>> results
>>> comparing to Lucene DefaultSimilarity.  I suspect, I am not using it
>>> correctly.  I have single field documents.  This is the process I use:
>>>
>>> 1. During the indexing, I am setting the similarity to BM25 as such:
>>>
>>> IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(
>>>               Version.LUCENE_CURRENT), true,
>>>               IndexWriter.MaxFieldLength.UNLIMITED);
>>> writer.setSimilarity(new BM25Similarity());
>>>
>>> 2. During the Precision/Recall measurements, I am using a
>>> SimpleBM25QQParser extension I added to the benchmark:
>>>
>>> QualityQueryParser qqParser = new SimpleBM25QQParser("title", "TEXT");
>>>
>>>
>>> 3. Here is the parser code (I set an avg doc length here):
>>>
>>> public Query parse(QualityQuery qq) throws ParseException {
>>>   BM25Parameters.setAverageLength(indexField, 798.30f);//avg doc length
>>>   BM25Parameters.setB(0.5f);//tried default values
>>>   BM25Parameters.setK1(2f);
>>>   return query = new BM25BooleanQuery(qq.getValue(qqName), indexField,
>>> new
>>> StandardAnalyzer(Version.LUCENE_CURRENT));
>>> }
>>>
>>> 4. The searcher is using BM25 similarity:
>>>
>>> Searcher searcher = new IndexSearcher(dir, true);
>>> searcher.setSimilarity(sim);
>>>
>>> Am I missing some steps?  Does anyone have experience with this code?
>>>
>>> Thanks,
>>>
>>> Ivan
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>>
> --
> -----------------------------------------------------------
> Joaquín Pérez Iglesias
> Dpto. Lenguajes y Sistemas Informáticos
> E.T.S.I. Informática (UNED)
> Ciudad Universitaria
> C/ Juan del Rosal nº 16
> 28040 Madrid - Spain
> Phone. +34 91 398 89 19
> Fax    +34 91 398 65 35
> Office  2.11
> Email: joaquin.perez@lsi.uned.es
> web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/>
> -----------------------------------------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message