lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Mark Nemeskey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
Date Fri, 01 Apr 2011 08:47:05 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014472#comment-13014472
] 

David Mark Nemeskey commented on LUCENE-2959:
---------------------------------------------

Robert,

As for the problems with BM25F

{quote}
    * for any field, Lucene has a per-field terms dictionary that contains that term's docFreq.
To compute BM25f's IDF method would be challenging, because it wants a docFreq "across all
the fields".
    * the same issue applies to length normalization, lucene has a "field length" but really
no concept of document length.
{quote}

One thing that is not clear for me is why these limitations would not be a problem for BM25.
As I see it, the difference between the two methods is that BM25 simply computes tfs, idfs
and document length from the whole document -- which, according to what you said, is not available
Lucene. That's why I figured that a variant of BM25F would actually be more straightforward
to implement.

{quote}
(its not clear to me at a glance either from the original paper, if this should be across
only the fields in the query, across all the fields in the document, and if a "static" schema
is implied in this scoring system (in lucene document 1 can have 3 fields and document 2 can
have 40 different ones, even with different properties).
{quote}

Actually I am not sure there is a consensus on what BM25F actually is. :) For example, the
BM25 formula can be applied to the weighted sum of field tfs, or alternatively, the per-field
BM25 scores can be summarized as well after normalization. I've seen both called (maybe incorrectly)
BM25F.

If I understand correctly, the current scoring algorithm takes into account only the fields
explicitly specified in the query. Is that right? If so, I see no reason why BM25 should behave
otherwise. Which of course also means that we probably won't be able to save the summarized
doc length and idf.

Robert, would you be so kind to have a look at my proposal? It can be found at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1.
It's basically the same as what I sent to the mailing list. I wrote that I want to implement
BM25, BM25F and DFR ("the framework", I meant with one or two smoothing models), as well as
to convert the original scoring to the new framework. In light of the thread here, I guess
it would be better to modify these goals, perhaps by:
* deleting the conversion part?
* committing myself to BM25/BM25F only?
* explicitly stating that I want a higher level API based on the low-level one?

As for the last item, it is only if I continue / join the work in 2392. Since I guess nobody
wants two ranking frameworks, of course I will, but then in this part of the proposal should
I just concentrate on the higher level API?

Thanks!

> [GSoC] Implementing State of the Art Ranking for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-2959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2959
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Examples, Javadocs, Query/Scoring
>            Reporter: David Mark Nemeskey
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf
>
>
> Lucene employs the Vector Space Model (VSM) to rank documents, which compares
> unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture
is
> tailored specically to VSM, which makes the addition of new ranking functions a non-
> trivial task.
> This project aims to bring state of the art ranking methods to Lucene and to implement
a
> query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message