lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
Date Thu, 31 Mar 2011 12:23:05 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013944#comment-13013944
] 

Robert Muir commented on LUCENE-2959:
-------------------------------------

{quote}
I think the main point would be to make the addition of a new ranking function as easy as
possible. At least a prototype implementation should be very straightforward, even at the
expense of performance. Then, if the new method provides good results, the developer can go
on to the lower level to squeeze more juice out of it. It's hard for me to discuss new this
without knowing the code, of course, but do you think it is possible?
{quote}

This sounds great! For example, you could extend the low-level api, gather every possible
statistic that lucene has, and present a high-level api that looks more like terrier's scoring
api (which i'm guessing is what researchers would prefer?), where they basically implement
the scoring in one method with all the stats there.

So someone would extend this API to do prototyping, it would make it easier to experiment.

{quote}
I think I will follow your advice and concentrate on how to make BM25F fast.
{quote}

Actually as far as BM25f, this one presents a few challenges (some already discussed on LUCENE-2091).


To summarize:
* for any field, Lucene has a per-field terms dictionary that contains that term's docFreq.
To compute BM25f's IDF method would be challenging, because it wants a docFreq "across all
the fields". (its not clear to me at a glance either from the original paper, if this should
be across only the fields in the query, across all the fields in the document, and if a "static"
schema is implied in this scoring system (in lucene document 1 can have 3 fields and document
2 can have 40 different ones, even with different properties).
* the same issue applies to length normalization, lucene has a "field length" but really no
concept of document length. 

So I just wanted to mention that while its possible here to apply a per-field TF boost before
the non-linear TF saturation, its not immediately clear how to adjust the BM25f formula to
lucene: how to combine these scores without using a (wasteful) "catch-all-field" and some
lying behind the scenes to force this catch-all-field's length normalization and docFreq to
be used.

Too many questions arise for BM25f and how it would "fit" with lucene, for example the fact
that "multiple fields" can really mean anything, and having a field in lucene doesnt mean
at all that it was in your original document! For example, Solr users frequently use a "copyField"
to take the content of one field, duplicate it to a different field (and perhaps apply some
processing). In terms of things like length normalization, it seems that "document length"
calculated as the sum across the fields would be wrong for many use cases.

I only wanted to recommend against this one because of this rather serious challenge, it seems
its something we might want to table at the moment: lucene is changing fast and as new capabilities
arise, we might realize there is a more elegant way to address this... but at the moment I
think I would recommend starting with BM25.




> [GSoC] Implementing State of the Art Ranking for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-2959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2959
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Examples, Javadocs, Query/Scoring
>            Reporter: David Mark Nemeskey
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf
>
>
> Lucene employs the Vector Space Model (VSM) to rank documents, which compares
> unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture
is
> tailored specically to VSM, which makes the addition of new ranking functions a non-
> trivial task.
> This project aims to bring state of the art ranking methods to Lucene and to implement
a
> query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message