lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Pohl (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-4100) Maxscore - Efficient Scoring
Date Sat, 02 Jun 2012 15:48:22 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stefan Pohl updated LUCENE-4100:
--------------------------------

    Attachment: maxscore.patch
                contrib_maxscore.tgz

Attached is a tarball that includes maxscore code (to be unpacked in /lucene/contrib/), and
a patch that integrates it into core Lucene (for now, basis for both is trunk r1300967).

>From the README, included in the tarball:
This contrib package implements the 'maxscore' optimization, orginally presented by in the
IR domain in 1995 by H. Turtle & J. Flood.

If you'd like to play with this implementation, for instance, to estimate
its usefulness for your kind of queries and index data, follow these steps:
1) Build a normal Lucene40 index with your data
2) Rewrite this index using the main method of the class
   org.apache.lucene.index.IndexRewriter
   with source and destination directories as arguments. This class will iterate over your
index segments, parse them, compute a maxscore for each term using collection statistics of
the source index and write them to the destination directory using the Lucene40Maxscore codec.
The resulting index should be slightly bigger. Currently, Lucene's DefaultSimilarity will
be used to estimate maxscores, meaning that this has to be the Similarity used at querying
time for maxscore to be effective.
3) Apply the patch to a checkout of Lucene4 trunk revision 1300967 and place the maxscore
code directory below /lucene/contrib/.
4) After the patch, there should be the required logic in  
   org.apache.lucene.search.BooleanQuery to use the MaxscoreScorer on the
   index in 2) when the index is searched as usual:

   int topk = 10;
   searcher.setSimilarity(new DefaultSimilarity());
   Query q = queryparser.parse("t1 t2 t3 t4");
   MaxscoreDocCollector ms_coll = new MaxscoreDocCollector(topk);
   searcher.search(q, ms_coll);

Note:
- Your index at 1) does not have to be 'optimized' (it does not have to consist
  of one index segment only). In fact, maxscore can be more efficient with
  multiple segments because multiple maxscores are computed for many frequent
  terms for subsets of documents, resulting in tighter bounds and more effective
  pruning.
- Don't expect totalHits to return the same counts as before.
  MaxscoreDocCollector sole purpose is to notify you about this by throwing
  an exception when you try to use the getter.
- Currently, purely disjunctive, flat queries are supported
- DefaultSimilarity tested only
- @experimental !

                
> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0
>            Reporter: Stefan Pohl
>              Labels: api-change, patch, performance
>             Fix For: 4.0
>
>         Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient algorithm firstly
published in the IR domain in 1995 by H. Turtle & J. Flood, that I find deserves more
attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with example queries
and lucenebench, the package of Mike McCandless, resulting in very significant speedups.
> This ticket is to get started the discussion on including the implementation into Lucene's
codebase. Because the technique requires awareness about it from the Lucene user/developer,
it seems best to become a contrib/module package so that it consciously can be chosen to be
used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message