lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Earwin Burrfoot <ear...@gmail.com>
Subject Re: Integrating Language Models into Lucene
Date Thu, 26 Feb 2009 01:50:06 GMT
Have you looked at MG4J (http://mg4j.dsi.unimi.it/)?
Last time I did, it looked like an opposite of lucene - nice and
up-to-date algorithmics, but hard to apply to complex real-world
tasks.

On Thu, Feb 26, 2009 at 04:21, Koren Krupko <krupkor@gmail.com> wrote:
>
> Hello Lucene Developers!
>
> My name is Koren Krupko. I'm quite new to Lucene but I do have experience in
> research in the fields of information retrieval. After reviewing Lucene's
> capabilities I understand that one of its major strengths is its scalability
> (as opposed to other frameworks such as Lemur). However, the retrieval and
> scoring models used by Lucene are based upon the rather obsolete traditional
> Vector Space Model. I'm interested in adding newer, state of the art,
> retrieval models based on the notion of Language Models (see
> http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdf  for more
> details).
> During the last years, retrieval systems based on LM have outperformed their
> VSM based counterparts consistently in well recognized competitions such as
> TREC. Thus, in order to make Lucene more attractive to IR researchers, I
> would like to implement the following LM scoring functions using both
> Jelinek-Mercer and Dirichlet priors smoothing functions: Query Likelihood,
> KL-Divergence and Cross Entropy.
> Integrating Language Models into Lucene in addition to its proven
> performance capabilities and ease of use, will undoubtedly advance Lucene
> into becoming the leading open source IR framework.
>
> Assuming the usage of an Inverted Index holding posting lists, in order to
> implement  basic LM scoring functions, I need the following information
> available during query time:
> 1.      For each term in the inverted index –
> a.      Frequency in every document.
> b.      Frequency in the corpus.
> 2.      For each document – its size.
> 3.      Total size of the corpus.
> As I understand, 1a is implemented in Lucene but the problem is getting 1b,
> 2 and 3 since these details are not calculated during indexing. As I see it,
> one could use the Payload to store document size. However, adding the Corpus
> statistics requires fundamental changes in the index file format. From first
> glance, this addition isn't substantial space-wise since all we need is one
> more parameter per term. My eventual goal is to build a more complete and
> comprehensive index once that will allow running multiple sessions of
> retrieval using different scoring models later.
> I did a survey of the forum but didn't find anything similar to my ideas
> (the closest I got was https://issues.apache.org/jira/browse/LUCENE-965). I
> also understand that there are thoughts regarding changing the index format
> in the future ("flexible indexing" -
> https://issues.apache.org/jira/browse/LUCENE-1458).
>
> My questions are:
> 1.      Has anyone tried to do something similar in the past?
> 2.      Is anyone working on something similar at the moment?
> 3.      Do you think LM can/should become a part of official future Lucene
> versions?
> 4.      How would you recommend implementing the index additions with minimal
> changes as a temporary patch?
>
> Koren
>
> --
> View this message in context: http://www.nabble.com/Integrating-Language-Models-into-Lucene-tp22215790p22215790.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message