lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: best way to iterate through all docs from a query
Date Sat, 21 Nov 2009 10:02:06 GMT
If you only want to gather the docIDs, this approach is rather
wasteful, as it 1) computes the score of each hit, and 2) keeps a
sorted queue of all these hits.  It also pre-allocates a 1M sized
array, which is wasteful if your index doesn't have 1M docs.

It's better to create your own subclass of Collector that eg appends
each collected docID to a List or Array, or sets them in a bit set (if
you expect more than 1/8th of your index may be returned by the
query).

Mike

On Fri, Nov 20, 2009 at 10:17 AM, it99 <deswiatlowski@syrres.com> wrote:
>
> Thanks that helped a lot with the speed!!
> I am getting same search results but with different docIds. Is this expected
> and OK? Are they just arbitrar numbers
>
> If  I changed from
>            Hits hits = mSearcher.search(query, filter);
>
>
> To the following
>            TopDocCollector collector = new TopDocCollector(1000000);
>            mSearcher.search(query, filter,collector);
>            hits = collector.topDocs().scoreDocs;
>
>
>
>
>
>
>
> Ian Lea wrote:
>>
>> First queries are often slow and subsequent ones faster.  Search the
>> list for warming - I think there was something on it in the last
>> couple of days.  Or read the "When measuring performance, disregard
>> the first query" bit of
>> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
>>
>> A good number to pass to the Collector is however many docs you are
>> going to be interested in.  If you are just going to display the first
>> 10, pass 10.
>>
>>
>> --
>> Ian.
>>
>>
>> On Thu, Nov 19, 2009 at 3:36 PM, it99 <deswiatlowski@syrres.com> wrote:
>>>
>>> What is the best way to iterate across all the documents in a search
>>> results?
>>> Previously I was using the deprecated Hits object but changed the
>>> implentations as recommended in javadocs to ScoreDoc.
>>>
>>> I've tried the following but I've seen warning about peformance.
>>> Seems the first time I query something it takes long time and then after
>>> that it is quick.
>>>
>>>
>>>
>>>                for (int i = 0; i < mNumberOfHits; i++)
>>>                {
>>>
>>>                    int docId = hits[i].doc;
>>>                    Document doc = searcher.doc(docId);
>>>                }
>>>
>>> Here's the code for the search
>>> What is good number to pass intot TopDocCollector?
>>>
>>>            TopDocCollector collector = new TopDocCollector(1000000);
>>>            searcher.search(query, collector);
>>>            ScoreDoc[] hits = collector.topDocs().scoreDocs;
>>> --
>>> View this message in context:
>>> http://old.nabble.com/best-way-to-iterate-through-all-docs-from-a-query-tp26421373p26421373.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/best-way-to-iterate-through-all-docs-from-a-query-tp26421373p26442733.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message