Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CAH=GueTUXWD8jqVeyd6yoRRh8yu69CaWxkC6Q_HfFqEjapgA7A@mail.gmail.com>
References: 
 <CAH=GueTUXWD8jqVeyd6yoRRh8yu69CaWxkC6Q_HfFqEjapgA7A@mail.gmail.com>
Date: Thu, 15 Nov 2012 14:04:12 -0500
Message-ID: 
 <CAMySt+GWOn80USP_wR05Yiz-xF590tOB70dk1fO7mKUNc9OgTg@mail.gmail.com>
Subject: Re: BM25 model for solr 4?
From: Tom Burton-West <tburtonw@umich.edu>
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=0016e6d285b419651b04ce8d5120

--0016e6d285b419651b04ce8d5120
Content-Type: text/plain; charset=ISO-8859-1

Hello Floyd,

There is a ton of research literature out there comparing BM25 to vector
space.  But you have to be careful interpreting it.

BM25 originally beat the SMART vector space model in the early  TRECs
 because it did better tf and length normalization.  Pivoted Document
Length normalization was invented to get the vector space model to catch up
to BM25.   (Just Google for Singhal length normalization.  Amith Singhal,
now chief of Google Search did his doctoral thesis on this and it is
available.  Similarly Stephan Robertson, now at Microsoft Research
published a ton of studies of BM25)

The default Solr/Lucene similarity class doesn't provide the length or tf
normalization tuning params that BM25 does.  There is the sweetspot
simliarity, but that doesn't quite work the same way that the BM25
normalizations do.

Document length normalization needs and parameter tuning all depends on
your data.  So if you are reading a comparison, you need to determine:
1) When comparing recall/precision etc. between vector space and Bm25, did
the experimenter tune both the vector space and the BM25 parameters
2) Are the documents (and queries) they are using in the test, similar in
 length characteristics to your documents and
queries.

We are planning to do some testing in the next few months for our use case,
which is 10 million books where we index the entire book.  These are
extremely long documents compared to most IR research.
I'd love to hear about actual (non-research) production implementations
that have tested the new ranking models available in Solr.

Tom


On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu <floyd.wu@gmail.com> wrote:

> Hi there,
> Does anybody can kindly tell me how to setup solr to use BM25?
> By the way, are there any experiment or research shows BM25 and classical
> VSM model comparison in recall/precision rate?
>
> Thanks in advanced.
>

--0016e6d285b419651b04ce8d5120--