lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <joaquin.pe...@lsi.uned.es>
Subject Re: Summer of Code idea for lucene
Date Sun, 14 Sep 2008 09:54:19 GMT
Hi Mark, just trying to answer some of your questions.

Related with comparisons, I have prepared a small experiment using an EuroGov 2005 Collection
subset. Basically I have used the documents of the uk domain (around 60000) and 48 queries,
the result are as next

=Lucene=
MAP=> .4777	MRR=> .4852
=BM25=
MAP=> .5147	MRR=> .5227


As you can see there is an improvement of a 7% in terms of MAP and MRR, using standard parameters
for BM25, this is quite an improvement although the collection is not a big one. 
Indeed in the last years BM25 (and others models) have outperformed the more classical approaches
with VSM in competitions like TREC.

In terms of performance I have run 100 times the same 48 topic the results are as next
Lucene=> 18.7 secs
BM25=>   24.8 secs

BM25 is approximately 25% slower than Lucene, sure this could be optimized, however 4800 queries
in 25secs sounds good to me.

About the implementation I have tried to separate the boolean model from the ranking function
used, basically there are support for simple (not nested ones) boolean queries with AND, OR
and NOT operators. 

I'm not sure if this is correct from a strict theoretical point of view. BM25 is a probabilistic
ranking function which in some way means that just the OR operator should be supported.
However the others operators have been added in the same way as Lucene does mixing VSM with
boolean model, when different operators are used BM25BooleanScorer is used what is pretty
much slower than BM25SingleBooleanScorer, which calculates the scorer when just one boolean
operator type is used.

The simplified boolean model is implemented with MatchAllBooleanScorer, MustBooleanScorer,
ShouldBooleanScorer and NotBooleanScorer, sure this could be rewrited in order to looks more
similar to Lucene, but currently I'm not sure how to do it. Currently a termdoc for each term
is used in order to filter from the boolean query, and the result docs are ranked with BM25TermScorer.

The ranking function is as http://nlp.uned.es/~jperezi/Lucene-BM25/ and is implemented in
BM25TermScorer (for one term), I have tried to optimize it as much as I can, but sure this
could be improved, any ideas?

As you can see there are some stuff to work on it, but I hope that with some help this could
be improved, I believe that to add support for BM25  into Lucene framework will be a great
step forward for this, currently is quite common to find support for this ranking function
in a lot of search engines (Terrier, Sphinx, MG4J, Lemur, Xapian,...)

Thanks.

-----------------------------------------------------------
Joaquín Pérez Iglesias
Dpto. Lenguajes y Sistemas Informáticos
E.T.S.I. Informática (UNED)
Ciudad Universitaria
C/ Juan del Rosal nº 16
28040 Madrid - Spain
Phone. +34 91 398 87 25
Fax    +34 91 398 65 35
Office  2.07
Email: joaquin.perez@lsi.uned.es
-----------------------------------------------------------

Mark Miller <markrmiller@gmail.com> escribe :

> Cool, thanks.
> 
> Have you done any comparisons with the current scoring system? Can you 
> claim strong improvements? Have you looked at the performance impact at 
> all yet? That score method in termscorer looks particularly slow. Could 
> you explain a little how your bm25 implementation differs from the 
> current imp in a bit more depth? It looks like your booleanquery is 
> pretty simplified compared to the old.
> 
> joaquin.perez@lsi.uned.es
> wrote:
> > Hi Mark,
> >
> > thank you for your advice, I don't know too much about licenses.
> > I have just changed the license to Apache-2.0, hope that this will be ok,
> and make things easier.
> >
> > If you need any help or have some comments about the implementation,
> please let me know. I would be really happy if this implementation is finally
> integrated into Lucene.
> >
> > Joaquín.
> >
> > -----------------------------------------------------------
> > Joaquín Pérez Iglesias
> > Dpto. Lenguajes y Sistemas Informáticos
> > E.T.S.I. Informática (UNED)
> > Ciudad Universitaria
> > C/ Juan del Rosal nº 16
> > 28040 Madrid - Spain
> > Phone. +34 91 398 87 25
> > Fax +34 91 398 65 35
> > Office 2.07
> > Email: joaquin.perez@lsi.uned.es
> > -----------------------------------------------------------
> >
> > Mark Miller <markrmiller@gmail.com>
> escribe :
> >
> >   
> >> Hey Joaquin,
> >>
> >> Your work here looks very interesting. The Lucene community has shown
> a 
> >> strong interest in this area before (see LUCENE-965).
> >>
> >> I see you went with an lgpl license though. This might be a bit of a
> 
> >> barrier in getting feedback from a community based on apache license
> 
> >> software. Obviously, there still might be interest,learning, and an
> 
> >> exchange of ideas, but none of your code can be distributed with
> Lucene, 
> >> and so what you have done loses some of its appeal in that sense. Is
> 
> >> there any chance you would be willing to relax the license, possibly
> 
> >> gaining more feedback, contributors, and possible inclusion in Lucene?
> 
> >> Certainly not necessary to receive feedback, but I think it would help
> 
> >> -- I'd certainly be looking closer anyway.
> >>
> >> - Mark
> >>
> >> Joaquin Perez Iglesias wrote:
> >>     
> >>> Hi all,
> >>>
> >>> finally I got some time to finish the BM25/BM25F implementation
> for 
> >>> Lucene you can find more details at 
> >>> http://nlp.uned.es/~jperezi/Lucene-BM25/,
> >>>       
> >> it has been tested but I 
> >>     
> >>> cannot assure that is bugs free.
> >>> It would be great to receive some feedback about it.
> >>>
> >>> There are some details about the implementation that I consider
> will 
> >>> be of interest,as how to calculate the average_length or  idf at
> 
> >>> document level.
> >>> Please if you find any bug or mistake in the supplied
> implementation 
> >>> let me know and I will try to solve it, same for questions.
> >>>
> >>> Hope that some of you will find useful.
> >>>
> >>> Thanks in advance.
> >>>
> >>>
> >>>
> >>> joaquin.perez@lsi.uned.es
> >>>       
> >> escribió:
> >>     
> >>>> Hi Otis,
> >>>>
> >>>> as my colleague said, we have a first implementation of BM25
> over 
> >>>> Lucene, this development is part of a (almost finished) thesis
> 
> >>>> project that compares different IR models, over an standard
> 
> >>>> collection. At the same time we are trying to extend this
> first 
> >>>> implementation in order to support BM25F for multifield
> queries, 
> >>>> unfortunately at this time we are too busy to prepare a final
> version
> >>>>         
> >>>> of this code, so we will have to finish this code over the
> summer 
> >>>> (hopefully we will have more time :-))), and make it public at
> this
> >>>>         
> >>>> time.
> >>>>
> >>>> We will inform to this list when we will finish the
> preparation of a
> >>>>         
> >>>> final version.
> >>>>
> >>>> Thanks to everybody for the interest!!!
> >>>>
> >>>> Bye
> >>>> Joaquin
> >>>>
> >>>> -----------------------------------------------------------
> >>>> Joaquín Pérez Iglesias
> >>>> Dpto. Lenguajes y Sistemas Informáticos
> >>>> E.T.S.I. Informática (UNED)
> >>>> Ciudad Universitaria
> >>>> C/ Juan del Rosal nº 16
> >>>> 28040 Madrid - Spain
> >>>> Phone. +34 91 398 87 25
> >>>> Fax    +34 91 398 65 35
> >>>> Office  2.07
> >>>> Email: joaquin.perez@lsi.uned.es
> >>>> -----------------------------------------------------------
> Otis 
> >>>> Gospodnetic <otis_gospodnetic@yahoo.com>
> >>>>         
> >> escribe :
> >>     
> >>>>  
> >>>>         
> >>>>> Hi Jose,
> >>>>>
> >>>>> I was wondering if you ever got to this.  I would love to
> see and
> >>>>>           
> >>>>> try BM25 for
> >>>>> Lucene!
> >>>>>
> >>>>>
> >>>>> I'm looking at http://code.google.com/soc/2008/asf/about.html
> >>>>> and it looks like this didn't make it into GSoC, but this
> would
> >>>>>           
> >>>>> still be great
> >>>>> to have.
> >>>>>
> >>>>> Thanks,
> >>>>> Otis
> >>>>> -- 
> >>>>> Sematext -- http://sematext.com/ --
> >>>>> Lucene - Solr - Nutch
> >>>>>
> >>>>>
> >>>>> ----- Original Message ----
> >>>>>    
> >>>>>           
> >>>>>> From: José Ramón Pérez Agüera <jose.aguera@gmail.com>
> >>>>>> To: java-dev@lucene.apache.org;
> >>>>>>       
> >>>>>>             
> >>>>> Joaquin Perez-Iglesias <joaquin.perez.iglesias@gmail.com>
> >>>>>    
> >>>>>           
> >>>>>> Sent: Saturday, March 15, 2008 4:54:08 AM
> >>>>>> Subject: Re: Summer of Code idea for lucene
> >>>>>>
> >>>>>> we have almost implemented BM25 using lucene
> structure, but we
> >>>>>>             
> >> need
> >>     
> >>>>>> help to finish query parser and other details. If you
> o
> >>>>>>             
> >> somebody want
> >>     
> >>>>>> We can send you the code and you can help us to
> implement the
> >>>>>>             
> >> query
> >>     
> >>>>>> parser and prepare the code to sandbox.
> >>>>>>
> >>>>>> If there are people interested I can made a web page
> for the
> >>>>>>             
> >> project
> >>     
> >>>>>> and put our implementatio to download
> >>>>>>
> >>>>>> Somebody is interested?
> >>>>>>
> >>>>>> jose
> >>>>>>
> >>>>>> -- 
> >>>>>> José Ramón Pérez Agüera
> >>>>>>
> >>>>>> Dept. de Ingeniería del Software e Inteligencia
> Artificial
> >>>>>> Despacho 411 tlf. 913947599
> >>>>>> Facultad de Informática
> >>>>>> Universidad Complutense de Madrid
> >>>>>>
> >>>>>> On Sat, Mar 15, 2008 at 5:32 AM, Ian Holsman wrote:
> >>>>>>      
> >>>>>>             
> >>>>>>> If no one objects (I don't think it's too late)
> >>>>>>>
> >>>>>>>  would you mind a GSOC project to implement
> BM25
> >>>>>>>         
> >>>>>>>               
> >>>>> relevancy/scoring?
> >>>>>     
> >>>>>
> >>>>>           
> >>
> ---------------------------------------------------------------------
> >>     
> >>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>>     
> >>>>>           
> >>>> ________________________________________________
> >>>> Servicio WebMail de CiberUNED http://www.uned.es
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>         
> >>
> ---------------------------------------------------------------------
> >>     
> >>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>
> >>>>
> >>>>   
> >>>>         
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>     
> >
> > ________________________________________________
> > Servicio WebMail de CiberUNED http://www.uned.es
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >   
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org

________________________________________________
Servicio WebMail de CiberUNED http://www.uned.es



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message