lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring
Date Thu, 12 Jul 2012 18:35:34 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413055#comment-13413055
] 

Robert Muir commented on LUCENE-4100:
-------------------------------------

{quote}
Your index at 1) does not have to be 'optimized' (it does not have to consist of one index
segment only). In fact, maxscore can be more efficient with multiple segments because multiple
maxscores are computed for many frequent terms for subsets of documents, resulting in tighter
bounds and more effective pruning.
{quote}

I've been thinking about this a lot lately: while what you say is true, thats because you
reprocess all segments with IndexRewriter (which is fine for a static collection).

But this algorithm in general is not rank safe with incremental indexing: the problem is that
when doing actual scoring,
scores consist of per-segment/within document stats (term frequency, document length), but
also are affected by collection-wide
statistics from many other segments (IDF, average document length, ...) or even machines in
a distributed collection.

So I think for this to work and remain rank-safe, we cannot write the entire score into the
segment, because the score
at actual search time is dependent on all the other segments being searched. Instead I think
this can only work when
we can easily factor out an impact (e.g. in the case of DefaultSimilarity the indexed maxscore
excludes the IDF component,
this is instead multiplied in at search time).

I don't see how it can be rank-safe with algorithms like BM25 and incremental indexing, where
parameters like average document
length are not simple multiplicative factors into the formula: and determine exactly how important
tf versus document length play
a role in the score, but I'll think about it some more.

                
> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Stefan Pohl
>              Labels: api-change, patch, performance
>             Fix For: 4.0
>
>         Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient algorithm first
published in the IR domain in 1995 by H. Turtle & J. Flood, that I find deserves more
attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with example queries
and lucenebench, the package of Mike McCandless, resulting in very significant speedups.
> This ticket is to get started the discussion on including the implementation into Lucene's
codebase. Because the technique requires awareness about it from the Lucene user/developer,
it seems best to become a contrib/module package so that it consciously can be chosen to be
used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message