lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Integrating Language Models into Lucene
Date Thu, 26 Feb 2009 12:41:30 GMT
I think there is a group in the Netherlands that has open sourced a  
version of Lucene using Language Models.

I'd certainly welcome alternate implementations.  There have been  
many, many discussions about "flexible indexing" (

, and I know there are a bunch of related JIRA issues too) on the list  
here that you might look at.  In fact, several people have made some  
progress towards it, such that we are getting close to being able to  
more easily plug in different scoring models.   With flex. indexing,  
you should be able to do #3 below, and I believe all the others are  
already possible.

On Feb 25, 2009, at 8:21 PM, Koren Krupko wrote:

> Hello Lucene Developers!
> My name is Koren Krupko. I'm quite new to Lucene but I do have  
> experience in
> research in the fields of information retrieval. After reviewing  
> Lucene's
> capabilities I understand that one of its major strengths is its  
> scalability
> (as opposed to other frameworks such as Lemur). However, the  
> retrieval and
> scoring models used by Lucene are based upon the rather obsolete  
> traditional
> Vector Space Model. I'm interested in adding newer, state of the art,
> retrieval models based on the notion of Language Models (see
> LM-review.pdf   
> for more
> details).
> During the last years, retrieval systems based on LM have  
> outperformed their
> VSM based counterparts consistently in well recognized competitions  
> such as
> TREC. Thus, in order to make Lucene more attractive to IR  
> researchers, I
> would like to implement the following LM scoring functions using both
> Jelinek-Mercer and Dirichlet priors smoothing functions: Query  
> Likelihood,
> KL-Divergence and Cross Entropy.
> Integrating Language Models into Lucene in addition to its proven
> performance capabilities and ease of use, will undoubtedly advance  
> Lucene
> into becoming the leading open source IR framework.
> Assuming the usage of an Inverted Index holding posting lists, in  
> order to
> implement  basic LM scoring functions, I need the following  
> information
> available during query time:
> 1.	For each term in the inverted index –
> a.	Frequency in every document.
> b.	Frequency in the corpus.
> 2.	For each document – its size.
> 3.	Total size of the corpus.
> As I understand, 1a is implemented in Lucene but the problem is  
> getting 1b,
> 2 and 3 since these details are not calculated during indexing. As I  
> see it,
> one could use the Payload to store document size. However, adding  
> the Corpus
> statistics requires fundamental changes in the index file format.  
> From first
> glance, this addition isn't substantial space-wise since all we need  
> is one
> more parameter per term. My eventual goal is to build a more  
> complete and
> comprehensive index once that will allow running multiple sessions of
> retrieval using different scoring models later.
> I did a survey of the forum but didn't find anything similar to my  
> ideas
> (the closest I got was 
> . I
> also understand that there are thoughts regarding changing the index  
> format
> in the future ("flexible indexing" -
> My questions are:
> 1.	Has anyone tried to do something similar in the past?
> 2.	Is anyone working on something similar at the moment?
> 3.	Do you think LM can/should become a part of official future Lucene
> versions?
> 4.	How would you recommend implementing the index additions with  
> minimal
> changes as a temporary patch?
> Koren
> -- 
> View this message in context:
> Sent from the Lucene - Java Developer mailing list archive at  
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message