lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soeren Pekrul <soeren.pek...@gmx.de>
Subject Re: Lucene search performance: linear?
Date Tue, 05 Dec 2006 18:37:10 GMT
Hello Lisheng,

a search process has to do usually two thinks. First it has to find the 
term in the index. I don’t know the implementation of finding a term in 
Lucene. I hope that the index is at least a sorted list or a binary 
tree, so it can search binary. The time finding a term depends of the 
term’s number n_t. If it searches binary the complexity is approximately 
log(n_t). The search time should be better then linear.

Second it has to collect the documents for a term. This depends of the 
documents number n_d for a term. It has to go thru the list of documents 
for a term. The time should be proportional to the number of documents 
for a term even if it doesn’t calculate the similarity. Usually the 
number of documents for a single term is less than the total number of 
documents in the collection and less than the total number of terms in 
the index.

If the number of documents for a single term is less than the total 
number of documents the search process for a single term including 
process one (finding the term) and process two (collecting the documents 
and calculating the score) should be better the linear to the number of 
documents.

> I indexed first 220,000, all with a special keyword, I did a simple
> query and only fetched 5 docs, with Hits.length()=220,000.
> 
> Then I indexed 440,000 docs, with the same keyword, query it
> again and fetched a few docs, with Hits.length(0=440,000.

In your case the query term is contained in all documents. The number of 
documents for a single term is equals the total number of documents in 
your collection. The hit collector has to collect all documents. The 
collecting process is proportional to the number of documents to 
collect. So the search for all documents should be at least linear to 
the total number of documents.

Sören

Zhang, Lisheng schrieb:
> Hi,
> 
> I indexed first 220,000, all with a special keyword, I did a simple
> query and only fetched 5 docs, with Hits.length()=220,000.
> 
> Then I indexed 440,000 docs, with the same keyword, query it
> again and fetched a few docs, with Hits.length(0=440,000.
> 
> I found that search time is about linear: 2nd time is about 2 times
> longer than 1st query. I would like to understand:
> 
> Does the linear relation come from score calculation, since we
> have to calculate score one by one? Or other reason?
> 
> If we have B-tree index I would naively expect a better scalibility?
> 
> Thanks very much for your helps, 
> 
> Lisheng

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message