lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis.gospodne...@gmail.com>
Subject Re: BM25 model for solr 4?
Date Fri, 16 Nov 2012 21:05:49 GMT
Hi Floyd,

I don't think there is a general answer to that question.  You would have
to test it with your corpus/index and your queries.  If you have that and
if you can have 2 indices, one using BM25 and the other using VSM or
anything else you want to compare, you would want to do some A/B testing
and compare various metrics that indicates which search is better.  Have a
look at the picture on
http://blog.sematext.com/2012/01/06/relevance-tuning-and-competitive-advantage-via-search-analytics/to
see what I mean.

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html




On Fri, Nov 16, 2012 at 12:28 AM, Floyd Wu <floyd.wu@gmail.com> wrote:

> Thanks everyone, especially to Tom, you do give me detailed explanation
> about this topic.
> Of course in academic we do need to interpret result carefully, what I care
> about is from end-users point of view, using BM25 will result better
> ranking instead of using lucene's original VSM+Boolean model? How
> significant difference will be presented?
> I'd like to see some sharing from community.
>
> Floyd
>
>
> 2012/11/16 Tom Burton-West <tburtonw@umich.edu>
>
> > Hello Floyd,
> >
> > There is a ton of research literature out there comparing BM25 to vector
> > space.  But you have to be careful interpreting it.
> >
> > BM25 originally beat the SMART vector space model in the early  TRECs
> >  because it did better tf and length normalization.  Pivoted Document
> > Length normalization was invented to get the vector space model to catch
> up
> > to BM25.   (Just Google for Singhal length normalization.  Amith Singhal,
> > now chief of Google Search did his doctoral thesis on this and it is
> > available.  Similarly Stephan Robertson, now at Microsoft Research
> > published a ton of studies of BM25)
> >
> > The default Solr/Lucene similarity class doesn't provide the length or tf
> > normalization tuning params that BM25 does.  There is the sweetspot
> > simliarity, but that doesn't quite work the same way that the BM25
> > normalizations do.
> >
> > Document length normalization needs and parameter tuning all depends on
> > your data.  So if you are reading a comparison, you need to determine:
> > 1) When comparing recall/precision etc. between vector space and Bm25,
> did
> > the experimenter tune both the vector space and the BM25 parameters
> > 2) Are the documents (and queries) they are using in the test, similar in
> >  length characteristics to your documents and
> > queries.
> >
> > We are planning to do some testing in the next few months for our use
> case,
> > which is 10 million books where we index the entire book.  These are
> > extremely long documents compared to most IR research.
> > I'd love to hear about actual (non-research) production implementations
> > that have tested the new ranking models available in Solr.
> >
> > Tom
> >
> >
> >
> > On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu <floyd.wu@gmail.com> wrote:
> >
> > > Hi there,
> > > Does anybody can kindly tell me how to setup solr to use BM25?
> > > By the way, are there any experiment or research shows BM25 and
> classical
> > > VSM model comparison in recall/precision rate?
> > >
> > > Thanks in advanced.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message