lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <>
Subject Re: CachingDirectory contribution
Date Mon, 08 Oct 2001 21:12:26 GMT
Doug, what about scaling with the number of concurrent queries? In other 
words, assuming that the queries are similar, how is the search 
performance affected by increase in the number of concurrent users?

Doug Cutting wrote:

>>From: []
>>>Performance should be more-or-less linear: a two-million 
>>document index will
>>>be almost twice as slow to search as a one-million document index.
>>Is performance linear in the number of documents? I would naively
>>think it would be linear in the number of terms. Also, how does
>>document size change things? Is searching 1 million 100k documents
>>twice as fast as 1 million 200k documents?
>Some more precise statements: The cost to search for a term is proportional
>to the number of documents that contain that term.  The cost to search for a
>phrase is proportional to the sum of the number of occurrences of its
>constituent terms.  The cost to execute a boolean query is the sum of the
>costs of its sub-queries.  Longer documents contain more terms: usually both
>more unique terms and more occurrences.
>Total vocabulary size is not a big factor in search performance.  When you
>open an index Lucene does read one out of every 128 unique terms into a
>table, so an index with a large number of unique terms will be slower to
>open.  Searching that table for query terms is also slower for bigger
>indexes, but the time to search that table is not significant in overall
>performance.  Lucene also reads at index open one byte per document per
>indexed field (the normalization factor).  So an index with lots of
>documents and fields will also be slower to open.  But, once opened, the
>cost of searching is largely dependent on the frequency characteristics of
>query terms.  And, since IndexReaders and Searchers are thread safe, you
>don't need to open indexes very often.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message