lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From renou oki <>
Subject Re : Stop search process when a given number of hits is reached
Date Thu, 07 Aug 2008 17:00:08 GMT
Thanks a lot for your responses...

I have tried the HitCollector and throw an exception when the limit of hits is reached...
It works fine and the search time is really reduce when there is a lot of docs which are matching
the query...

I did that :

public class CountCollector extends HitCollector{
    public int cpt;
    private int _maxHit;
    public CountCollector(int maxHit)
        cpt = 0;
        _maxHit = maxHit
    public void collect(int arg0, float arg1) 
        if (cpt > _max_Hit)
            throw new LimitIsReachedException();

With a simple try catch, I catch the exception, and display "cpt" (the counter)...

Best regards

----- Message d'origine ----
De : Andrzej Bialecki <>
À :
Envoyé le : Jeudi, 7 Août 2008, 14h29mn 31s
Objet : Re: Stop search process when a given number of hits is reached

Doron Cohen wrote:
> Nothing built in that I'm aware of will do this, but it can be done by
> searching with your own HitCollector.
> There is a related feature - stop search after a specified time - using
> TimeLimitedCollector.
> It is not released yet, see issue LUCENE-997.
> In short, the collector's collect() method is invoked in the search process
> for each matching document.
> Once 500 docs were collected, your collector can cause the search to stop by
> throwing an exception.
> Upon catching the exception you know that 500 docs were collected.

Two additional comments:

* the topN results from such incomplete search may be way off, if there 
were some high scoring documents somewhere beyond the limit.

* if you know that there are more important and less important documents 
in your corpus, and their relative weight is independent of the query 
(e.g. PageRank-type score), then you can restructure your index so that 
postings belonging to highly-scoring documents come first on the posting 
lists - this way you have a better chance to collect highly relevant 
documents first, even though the search is incomplete. You can find an 
implementation of this concept in Nutch 

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

Envoyez avec Yahoo! Mail. Une boite mail plus intelligente
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message