lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
Date Wed, 30 Mar 2011 14:54:05 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012996#comment-13012996
] 

Robert Muir commented on LUCENE-2959:
-------------------------------------

Hi David, to try to help get things moving, I created a branch: https://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring

This is the in-progress work from LUCENE-2392, which separates the scoring calculates from
the postings-list matching. In short, Similarity becomes very low-level, but the idea is that
you extend it to present a higher-level API (for example TFIDFSimilarity: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java)
that is user-friendly and allows users to adjust parameters in a way that makes sense to that
scoring system.

As a start I implemented some very very rough/basic models in src/test:
BM25: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockBM25Similarity.java

Dirichlet LM: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockLMSimilarity.java

But these are in no way correct or extensible or nice. For example, the BM25 similarity is
slow, because as implemented its "average document length" is "live" (e.g. if you add more
segments its immediately adjusted for each query)... there is no caching at all. 

For example in this case, to speed up BM25, it could be nice for the Similarity to pull this
up-front, and create cached calculations. If a user wants to refresh their bm25 stats then
they could do something in SimilarityProvider to call it to recalculate the caches.

However, for a user that wants "super-realtime" view, it might be better for it to stay the
way it is now, or alternatively for the Sim to up-front do the 256 calculations per query
(ideally in weight, not per-segment in docscorer) to tableize the length normalizations.

So these are the API challenges we need to consider if we want to provide actual implementations
of these scoring systems: how to make them perform close to or as fast as lucene's current
scoring model.

Separately on the issue, I want to make Weight completely opaque to the sim, really its just
a way for a Similarity to compute things up front (such as IDF, but maybe things like these
bm25 length norm caches too). Currently it can only have a single float value (see my un-sqrt'ing
and other hacks in the Mock sims), so this should be fixed.

Additionally another big TODO: just as Scorer was split (maybe we should rename it to Matcher
now that sim does the calcs?), the process of Explanations need to be split too, where a Sim
is completely responsible for explaining itself.

Another TODO i have is to write the norm summation into the norms file as a single vlong,
rather than computing it across all byte[] in segmentreader like I do now... I just implemented
it this way so that we could play with scoring algorithms easily.

So, the good news would be that scoring is a lot more flexible, but the "bad" news is that
in order to support lucene's features, implementing a new ranking system on top of Similarity
is really *serious* work, as you need to:
# implement the lower-level API efficiently, yet expose a nice high-level API such as TFIDFSimilarity's
tf() and idf() hooks for users.
# implement explanations so that users can debug relevance issues.
# think about allowing users to balance the various performance tradeoffs, such as balancing
the performance gained by caching things versus using realtime statistics (some of this could
be in my head, maybe computing 256 norm decoder caches up-front is really cheap and a non-issue).
# consider how to integrate lucene's features into the ranking system, for example how to
estimate a reasonable "phrase IDF" for phrase/multiphrase/span queries, how to integrate index-time
boosts (in my example BM25 etc, I just made the documents appear shorter to accomplish this),
depending upon how the length normalization is being stored in the index, how to pick the
best quantization (might not be SmallFloat352), etc etc. 
# do all the relevance testing to ensure that things are correct (i found lots of bugs doing
rough testing on my Mock ones, there are probably more!, but on the couple test collections
i tried they seemed reasonable)
# adding good quality documentation such as what we have today in TFIDFSimilarity that explains
how the ranking system works and how you can tune it.


> [GSoC] Implementing State of the Art Ranking for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-2959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2959
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Examples, Javadocs, Query/Scoring
>            Reporter: David Mark Nemeskey
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: implementation_plan.pdf, proposal.pdf
>
>
> Lucene employs the Vector Space Model (VSM) to rank documents, which compares
> unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture
is
> tailored specically to VSM, which makes the addition of new ranking functions a non-
> trivial task.
> This project aims to bring state of the art ranking methods to Lucene and to implement
a
> query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message