lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: How to improve performance of large numbers of successive searches?
Date Thu, 10 Apr 2008 20:18:00 GMT
>From this <<< iterate over all of the hits>>> I infer that you're
using a Hits object. This is a no-no when getting more than 100
or so objects. In a nutshell, the query gets re-executed every 100
fetches. So your 2,000 hits are executing the query 20 times.

The Hits object is optimized for returning the top few scoring
documents rather than get the entire result set.

See HitCollector/TopDocs/TopDocCollector etc. for better ways
of doing this.

Also, if you're calling IndexReader.document(i) for each document
you'll inevitably take a lot of time as you're loading all of each document.
Think about lazy field loading (see FieldSelector).

Best
Erick

P.S. If this is totally off base, perhaps you could post some of the
code you think is slow....

On Thu, Apr 10, 2008 at 2:34 PM, Chris McGee <cbmcgee@ca.ibm.com> wrote:

> Hello,
>
> I am building fairly large directories (200-500 MB of disk space) using
> lucene-java. Sometimes it can take upwards of 10-15 mins to create the
> documents and write them to disk using my current configuration. I have
> upgraded to the latest 2.3.1 version and followed many of the
> recommendations offered on the wiki:
>
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>
> These tips have significantly improved the time to build the directory and
> search it. However, I have noticed that when I perform term queries using
> a searcher many times in rapid succession and iterate over all of the hits
> it can take a significant time. To perform 1000 term query searches each
> with around 2000 hits it takes well over a minute. The time seems to vary
> linearly based on the number of searches (ie. 10 times more searches take
> 10 times longer). I tried combining the searches into a BooleanQuery but
> it only shaves off a small percentage (5-10%) of the total time.
>
> I was wondering if there is a faster way to retrieve all of the results
> for my large collections of terms without using more memory and without
> taking more time to build the directory? I already looked at bypassing the
> searcher and using the IndexReader.termDocs() method directly to retrieve
> the documents but there did not seem to be much performance improvement.
> In the majority of my cases I am simplying looking for a large number of
> values to the same field. Also, I'm not interested in scoring results
> based on frequency or weights I need to retrieve all of the results
> anyway.
>
> Any help with this would be great.
>
> Thanks,
> Chris McGee

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message