lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: BM25 Scoring Patch
Date Tue, 16 Feb 2010 18:14:23 GMT
Ivan just a little more food for thought to help you with this:

I'm glad you got improved results, yet I stand by my original statement of
'be careful' interpreting too much from one collection.

eg. had you chosen TREC-4 instead of TREC-3, you would see different
results, as vector-space with non-cosine doc length norm (LUCENE-2187)
performed better than BM25 there:
http://trec.nist.gov/pubs/trec4/overview.ps.gz

in truth its hard to 'reuse' a pooled test collection to compare methods
that were not part of the pool:
http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf

This might help explain why you see such a difference in MAP score!

On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov <iprovalo@yahoo.com> wrote:

> Joaquin, Robert,
>
> I followed Joaquin's recommendation and removed the call to set similarity
> to BM25 explicitly (indexer, searcher).  The results showed 55% improvement
> for the MAP score (0.141->0.219) over default similarity.
>
> Joaquin, how would setting the similarity to BM25 explicitly make the score
> worse?
>
> Thank you,
>
> Ivan
>
>
>
> --- On Tue, 2/16/10, Robert Muir <rcmuir@gmail.com> wrote:
>
> > From: Robert Muir <rcmuir@gmail.com>
> > Subject: Re: BM25 Scoring Patch
> > To: java-user@lucene.apache.org
> > Date: Tuesday, February 16, 2010, 11:36 AM
> > yes Ivan, if possible please report
> > back any findings you can on the
> > experiments you are doing!
> >
> > On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> > <
> > joaquin.perez@lsi.uned.es>
> > wrote:
> >
> > > Hi Ivan,
> > >
> > > You shouldn't set the BM25Similarity for indexing or
> > searching.
> > > Please try removing the lines:
> > >   writer.setSimilarity(new
> > BM25Similarity());
> > >   searcher.setSimilarity(sim);
> > >
> > > Please let us/me know if you improve your results with
> > these changes.
> > >
> > >
> > > Robert Muir escribió:
> > >
> > >  Hi Ivan, I've seen many cases where BM25
> > performs worse than Lucene's
> > >> default Similarity. Perhaps this is just another
> > one?
> > >>
> > >> Again while I have not worked with this particular
> > collection, I looked at
> > >> the statistics and noted that its composed of
> > several 'sub-collections':
> > >> for
> > >> example the PAT documents on disk 3 have an
> > average doc length of 3543,
> > >> but
> > >> the AP documents on disk 1 have an avg doc length
> > of 353.
> > >>
> > >> I have found on other collections that any
> > advantages of BM25's document
> > >> length normalization fall apart when 'average
> > document length' doesn't
> > >> make
> > >> a whole lot of sense (cases like this).
> > >>
> > >> For this same reason, I've only found a few
> > collections where BM25's doc
> > >> length normalization is really significantly
> > better than Lucene's.
> > >>
> > >> In my opinion, the results on a particular test
> > collection or 2 have
> > >> perhaps
> > >> been taken too far and created a myth that BM25 is
> > always superior to
> > >> Lucene's scoring... this is not true!
> > >>
> > >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> > <iprovalo@yahoo.com>
> > >> wrote:
> > >>
> > >>  I applied the Lucene patch mentioned in
> > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > ran the MAP
> > >>> numbers
> > >>> on TREC-3 collection using topics
> > 151-200.  I am not getting worse
> > >>> results
> > >>> comparing to Lucene DefaultSimilarity.  I
> > suspect, I am not using it
> > >>> correctly.  I have single field
> > documents.  This is the process I use:
> > >>>
> > >>> 1. During the indexing, I am setting the
> > similarity to BM25 as such:
> > >>>
> > >>> IndexWriter writer = new IndexWriter(dir, new
> > StandardAnalyzer(
> > >>>
> >    Version.LUCENE_CURRENT), true,
> > >>>
> >    IndexWriter.MaxFieldLength.UNLIMITED);
> > >>> writer.setSimilarity(new BM25Similarity());
> > >>>
> > >>> 2. During the Precision/Recall measurements, I
> > am using a
> > >>> SimpleBM25QQParser extension I added to the
> > benchmark:
> > >>>
> > >>> QualityQueryParser qqParser = new
> > SimpleBM25QQParser("title", "TEXT");
> > >>>
> > >>>
> > >>> 3. Here is the parser code (I set an avg doc
> > length here):
> > >>>
> > >>> public Query parse(QualityQuery qq) throws
> > ParseException {
> > >>>   BM25Parameters.setAverageLength(indexField,
> > 798.30f);//avg doc length
> > >>>   BM25Parameters.setB(0.5f);//tried
> > default values
> > >>>   BM25Parameters.setK1(2f);
> > >>>   return query = new
> > BM25BooleanQuery(qq.getValue(qqName), indexField,
> > >>> new
> > >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> > >>> }
> > >>>
> > >>> 4. The searcher is using BM25 similarity:
> > >>>
> > >>> Searcher searcher = new IndexSearcher(dir,
> > true);
> > >>> searcher.setSimilarity(sim);
> > >>>
> > >>> Am I missing some steps?  Does anyone
> > have experience with this code?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Ivan
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > > --
> > >
> > -----------------------------------------------------------
> > > Joaquín Pérez Iglesias
> > > Dpto. Lenguajes y Sistemas Informáticos
> > > E.T.S.I. Informática (UNED)
> > > Ciudad Universitaria
> > > C/ Juan del Rosal nº 16
> > > 28040 Madrid - Spain
> > > Phone. +34 91 398 89 19
> > > Fax    +34 91 398 65 35
> > > Office  2.11
> > > Email: joaquin.perez@lsi.uned.es
> > > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/>
<
> http://nlp.uned.es/%7Ejperezi/>
> > >
> > -----------------------------------------------------------
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message