lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Earwin Burrfoot <>
Subject Re: Integrating Language Models into Lucene
Date Thu, 26 Feb 2009 01:50:06 GMT
Have you looked at MG4J (
Last time I did, it looked like an opposite of lucene - nice and
up-to-date algorithmics, but hard to apply to complex real-world

On Thu, Feb 26, 2009 at 04:21, Koren Krupko <> wrote:
> Hello Lucene Developers!
> My name is Koren Krupko. I'm quite new to Lucene but I do have experience in
> research in the fields of information retrieval. After reviewing Lucene's
> capabilities I understand that one of its major strengths is its scalability
> (as opposed to other frameworks such as Lemur). However, the retrieval and
> scoring models used by Lucene are based upon the rather obsolete traditional
> Vector Space Model. I'm interested in adding newer, state of the art,
> retrieval models based on the notion of Language Models (see
> LM-review.pdf  for more
> details).
> During the last years, retrieval systems based on LM have outperformed their
> VSM based counterparts consistently in well recognized competitions such as
> TREC. Thus, in order to make Lucene more attractive to IR researchers, I
> would like to implement the following LM scoring functions using both
> Jelinek-Mercer and Dirichlet priors smoothing functions: Query Likelihood,
> KL-Divergence and Cross Entropy.
> Integrating Language Models into Lucene in addition to its proven
> performance capabilities and ease of use, will undoubtedly advance Lucene
> into becoming the leading open source IR framework.
> Assuming the usage of an Inverted Index holding posting lists, in order to
> implement  basic LM scoring functions, I need the following information
> available during query time:
> 1.      For each term in the inverted index –
> a.      Frequency in every document.
> b.      Frequency in the corpus.
> 2.      For each document – its size.
> 3.      Total size of the corpus.
> As I understand, 1a is implemented in Lucene but the problem is getting 1b,
> 2 and 3 since these details are not calculated during indexing. As I see it,
> one could use the Payload to store document size. However, adding the Corpus
> statistics requires fundamental changes in the index file format. From first
> glance, this addition isn't substantial space-wise since all we need is one
> more parameter per term. My eventual goal is to build a more complete and
> comprehensive index once that will allow running multiple sessions of
> retrieval using different scoring models later.
> I did a survey of the forum but didn't find anything similar to my ideas
> (the closest I got was I
> also understand that there are thoughts regarding changing the index format
> in the future ("flexible indexing" -
> My questions are:
> 1.      Has anyone tried to do something similar in the past?
> 2.      Is anyone working on something similar at the moment?
> 3.      Do you think LM can/should become a part of official future Lucene
> versions?
> 4.      How would you recommend implementing the index additions with minimal
> changes as a temporary patch?
> Koren
> --
> View this message in context:
> Sent from the Lucene - Java Developer mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Kirill Zakharenko/Кирилл Захаренко (
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message