lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Integrating Language Models into Lucene
Date Thu, 26 Feb 2009 08:25:08 GMT
On Thursday 26 February 2009 02:21:41 Koren Krupko wrote:
> 
> Hello Lucene Developers!
> 
> My name is Koren Krupko. I'm quite new to Lucene but I do have experience in
> research in the fields of information retrieval. After reviewing Lucene's
> capabilities I understand that one of its major strengths is its scalability
> (as opposed to other frameworks such as Lemur). However, the retrieval and
> scoring models used by Lucene are based upon the rather obsolete traditional
> Vector Space Model. I'm interested in adding newer, state of the art,
> retrieval models based on the notion of Language Models (see  
> http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdf  for more
> details).
> During the last years, retrieval systems based on LM have outperformed their
> VSM based counterparts consistently in well recognized competitions such as
> TREC. Thus, in order to make Lucene more attractive to IR researchers, I
> would like to implement the following LM scoring functions using both
> Jelinek-Mercer and Dirichlet priors smoothing functions: Query Likelihood,
> KL-Divergence and Cross Entropy.
> Integrating Language Models into Lucene in addition to its proven
> performance capabilities and ease of use, will undoubtedly advance Lucene
> into becoming the leading open source IR framework.
> 
> Assuming the usage of an Inverted Index holding posting lists, in order to
> implement  basic LM scoring functions, I need the following information
> available during query time:
> 1.	For each term in the inverted index – 
> a.	Frequency in every document.
> b.	Frequency in the corpus.
> 2.	For each document – its size.
> 3.	Total size of the corpus.
> As I understand, 1a is implemented in Lucene but the problem is getting 1b,
> 2 and 3 since these details are not calculated during indexing. As I see it,
> one could use the Payload to store document size.

The field size is encoded in the norms.

> However, adding the Corpus
> statistics requires fundamental changes in the index file format. From first
> glance, this addition isn't substantial space-wise since all we need is one
> more parameter per term. My eventual goal is to build a more complete and
> comprehensive index once that will allow running multiple sessions of
> retrieval using different scoring models later.
> I did a survey of the forum but didn't find anything similar to my ideas
> (the closest I got was https://issues.apache.org/jira/browse/LUCENE-965). I
> also understand that there are thoughts regarding changing the index format
> in the future ("flexible indexing" -
> https://issues.apache.org/jira/browse/LUCENE-1458).
> 
> My questions are:
> 1.	Has anyone tried to do something similar in the past?

This is a term scorer that simply divides term frequency by field length:
https://issues.apache.org/jira/browse/LUCENE-293
A better field length encoding would be welcome, but it's a start.

> 2.	Is anyone working on something similar at the moment?

Me, not any more, but that's for other reasons than the qualities of LM.

> 3.	Do you think LM can/should become a part of official future Lucene
> versions?

A contrib module with an alternative set of scorers would be a nice goal,
for example starting from the one referenced above.

> 4.	How would you recommend implementing the index additions with minimal
> changes as a temporary patch?

No need for a temporary patch, just create a separate issue for each index
addition, and see what happens.

Regards,
Paul Elschot

Mime
View raw message