lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Stop search process when a given number of hits is reached
Date Thu, 07 Aug 2008 12:29:31 GMT
Doron Cohen wrote:
> Nothing built in that I'm aware of will do this, but it can be done by
> searching with your own HitCollector.
> There is a related feature - stop search after a specified time - using
> TimeLimitedCollector.
> It is not released yet, see issue LUCENE-997.
> In short, the collector's collect() method is invoked in the search process
> for each matching document.
> Once 500 docs were collected, your collector can cause the search to stop by
> throwing an exception.
> Upon catching the exception you know that 500 docs were collected.

Two additional comments:

* the topN results from such incomplete search may be way off, if there 
were some high scoring documents somewhere beyond the limit.

* if you know that there are more important and less important documents 
in your corpus, and their relative weight is independent of the query 
(e.g. PageRank-type score), then you can restructure your index so that 
postings belonging to highly-scoring documents come first on the posting 
lists - this way you have a better chance to collect highly relevant 
documents first, even though the search is incomplete. You can find an 
implementation of this concept in Nutch 
(org.apache.nutch.indexer.IndexSorter).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message