lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: Using HitCollector to Collect First N Hits
Date Sat, 22 Aug 2009 16:19:03 GMT
Hi Len,

On Sat, Aug 22, 2009 at 5:51 PM, Len Takeuchi<ltakeuchi@jostleme.com> wrote:
> Hello,
>
> I have attached the original thread from where I got my information at the
> very bottom in case it is of any help. In regards to whether I want just a
> boolean retrieval model, in the usage we are currently discussing, the
> answer is yes (I don't care about the score).  However, we also do other
> queries in other use cases where we do care about the score.
>
> In regards to the type of query, we are using a prefix query.  We have a
> problem with performance when the prefix entered by the user is short
> because it yields a large number of hits.  I was hoping that taking the
> approach I mentioned, search engine would call the HitCollector
> incrementally and give me a chance to end the search earlier but it seems
> like it is not happening.  Do you think the problem is with term expansion?
that is what my first guess was and I'm pretty sure that the long time
is taken before the documents get scored. A short prefix can easily
expand to thousands of terms, do you encounter
TooManyClausesExceptions and in turn do you set
BooleanQuery#setMaxClauseCount() to a higher value than 1024?
I wonder if BooleanQuery#setAllowDocsOutOfOrder(true) would give you
any performance hit if you don't care about the order of how the docs
come in. Any idea how many terms your prefix query expands to?

simon
>
> Regards,
> Len
>
> ----- original message -----
> From: simon.willnauer [at] googlemail
> Re: Using HitCollector to Collect First N Hits
>
> Hi Len,
> what kind of query do you execute when you collect the hits.
> HitCollector should be called for each document by the time it is
> scored. Is it possible that you run a query that could be expensive in
> terms of term expansion like WildcardQuery?
>
> simon
>
> ----- original message -----
> From : ltakeuchi [at] jostleme
> Stop search process when a given number of hits is reached
> Hello,
>
> I’m using Lucene 2.4.1 and I’m trying to use a custom HitCollector to
> collect
> only the first N hits (not the best hits) for performance. I saw another
> e-mail in this group where they mentioned writing a HitCollector which
> throws
> an exception after N hits to do this. So I tried this approach and it seems
> as if my HitCollector isn’t called until the hits have been determined, i.e.
> the time until my HitCollector is called is dependent on the number of hits
> and my performance is no better than when I was not using a custom
> HitCollector. Does anyone have insight into my problem? The person who tried
> approach mentioned performance improved significantly for him.
>
> Regards,
> Len
>
>
>
>
>
>
>
> ORIGINAL THREAD BELOW
> =====================
>
> ----- original message -----
> From : yodapoubelle [at] yahoo
> Re : Stop search process when a given number of hits is reached
>
> Thanks a lot for your responses...
>
> I have tried the HitCollector and throw an exception when the limit of hits
> is reached...
> It works fine and the search time is really reduce when there is a lot of
> docs which are matching the query...
>
> I did that :
>
> public class CountCollector extends HitCollector{
> public int cpt;
> private int _maxHit;
> public CountCollector(int maxHit)
> {
> cpt = 0;
> _maxHit = maxHit
> }
> public void collect(int arg0, float arg1)
> {
> cpt++;
> if (cpt > _max_Hit)
> {
> throw new LimitIsReachedException();
> }
> }
> }
>
> With a simple try catch, I catch the exception, and display "cpt" (the
> counter)...
>
> Best regards
>
> ----- Message d'origine ----
> De : Andrzej Bialecki <ab[at]getopt.org>
> À : java-user[at]lucene.apache.org
> Envoyé le : Jeudi, 7 Août 2008, 14h29mn 31s
> Objet : Re: Stop search process when a given number of hits is reached
>
> Doron Cohen wrote:
>> Nothing built in that I'm aware of will do this, but it can be done by
>> searching with your own HitCollector.
>> There is a related feature - stop search after a specified time - using
>> TimeLimitedCollector.
>> It is not released yet, see issue LUCENE-997.
>> In short, the collector's collect() method is invoked in the search
> process
>> for each matching document.
>> Once 500 docs were collected, your collector can cause the search to stop
> by
>> throwing an exception.
>> Upon catching the exception you know that 500 docs were collected.
>
> Two additional comments:
>
> * the topN results from such incomplete search may be way off, if there
> were some high scoring documents somewhere beyond the limit.
>
> * if you know that there are more important and less important documents
> in your corpus, and their relative weight is independent of the query
> (e.g. PageRank-type score), then you can restructure your index so that
> postings belonging to highly-scoring documents come first on the posting
> lists - this way you have a better chance to collect highly relevant
> documents first, even though the search is incomplete. You can find an
> implementation of this concept in Nutch
> (org.apache.nutch.indexer.IndexSorter).
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
> ----- original message -----
> From : yodapoubelle [at] yahoo
> Stop search process when a given number of hits is reached Remove
> Highlighting
>
> Hello
>
> Is there a way to stop the search process when a given number of hits is
> reached?
>
> I have a counter feature which displays how many docs match a query.
> This counter is blocked; I mean that if there are more than 500 docs, it
> will just display "more than 500".
> I don't care about the exact amount of docs matched by the query, the order
> of the hits or whatever...
> What I want is to stop the search process when it reaches at least 500 hits
> in order to improve performance...
> (I want an average search time in about 50 - 100 ms)
>
> I experimented with the following methods :
> For the same query:
> with search(Query query, Filter filter, Sort sort) hits=157691 docs in
> searchingTime=2514 ms
> with search(Query query, Filter filter, int n) (with n = 50) TopDocs
> totalHits 157691 in searchingTime= 2360 ms
>
> For another query:
> With search(Query query, Filter filter, Sort sort) hits=1208 docs in
> searchingTime=750 ms
> With search(Query query, Filter filter, int n) (with n = 50) TopDocs
> totalHits 1208 in searchingTime= 718 ms
>
> For another query:
> With search(Query query, Filter filter, Sort sort) hits=16174 cv(s)
> searchingTime=1297 ms
> With search(Query query, Filter filter, int n) (with n = 50) TopDocs
> totalHits 16174 in searchingTime= 1219 ms
>
> According to this results, replacing the first method by the other has no
> effect on either the search
> time or total number of hits returned
>
> Also the lucene version used now is 1.9.1 (but i work on the evolution to
> 2.3.2)
>
>
> Thanks a lot
> (Sorry for my bad English ... you will easily guess, I’m French ;)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message