lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Performance problem
Date Wed, 24 Aug 2005 15:58:40 GMT
On Wednesday 24 August 2005 09:32, WolfgangTäger wrote:
> Dear all,
> 
> we are using Lucene to store 10Mio bilingual sentence pairs for doing some 
> natural language processing with them. Each documents contains a sentence, 
> its translation and a topical code. We want to select sentences containing 
> certain words and do statistics over the topical codes in order to detect 
> translations which depend on the topic (like key=> Taste (topic: input 
> devices), key=> Schlüssel (topic: cryptography)).
> 
> While the search is carried out in a reasonably short time (about 
> 500..800ms) we have a performance problem with actually retrieving the 
> documents by code like:
> 
> for (int i = nrhits-1; i >=0; i--){
>         Document hitDoc = hits.doc(i);
>         String code=hitDoc.get("code");
>         ... statistics
> }
>  
> Even when restricting nrhits to 2000, we have to wait 10..20 seconds just 
> for the retrieval. Since the documents are so short we would have expected 
> a quicker retrieval. BtW the loop was done in inverse order in the hope to 
> accelerate the retrieval.
> 
> We are using Lucene 1.4.3 Java version on a Windows PC.
>  
> Would you recommend using the C version ? I suppose it is stable and we 
> can reuse the database ? Any other suggestions ?

For so much retrieval, it's better to roll your own:
Use the low level search api Searcher.search(Query, HitCollector) to collect
all the hits by doc number, keeping the scores if you need them.
Then sort these doc nrs (they normally are not far from sorted after
collecting), and retrieve all docs in that sorted order by
IndexReader.document(int).
In that way, with a bit of luck, the disk head never needs to change direction 
during retrieval, and prefetches by the operating system (if any) stand a lot
better change of actually being used.
In case you don't have the index reader around, open it explicitly
and construct your searcher from it.

Regards,
Paul Elschot


Mime
View raw message