lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: CachingDirectory contribution
Date Mon, 08 Oct 2001 19:21:05 GMT
> From: nelson@monkey.org [mailto:nelson@monkey.org]
> 
> >Performance should be more-or-less linear: a two-million 
> document index will
> >be almost twice as slow to search as a one-million document index.
> 
> Is performance linear in the number of documents? I would naively
> think it would be linear in the number of terms. Also, how does
> document size change things? Is searching 1 million 100k documents
> twice as fast as 1 million 200k documents?

Some more precise statements: The cost to search for a term is proportional
to the number of documents that contain that term.  The cost to search for a
phrase is proportional to the sum of the number of occurrences of its
constituent terms.  The cost to execute a boolean query is the sum of the
costs of its sub-queries.  Longer documents contain more terms: usually both
more unique terms and more occurrences.

Total vocabulary size is not a big factor in search performance.  When you
open an index Lucene does read one out of every 128 unique terms into a
table, so an index with a large number of unique terms will be slower to
open.  Searching that table for query terms is also slower for bigger
indexes, but the time to search that table is not significant in overall
performance.  Lucene also reads at index open one byte per document per
indexed field (the normalization factor).  So an index with lots of
documents and fields will also be slower to open.  But, once opened, the
cost of searching is largely dependent on the frequency characteristics of
query terms.  And, since IndexReaders and Searchers are thread safe, you
don't need to open indexes very often.

Doug

Mime
View raw message