lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <tburt...@umich.edu>
Subject RE: scalability limit in terms of numbers of large documents
Date Mon, 16 Aug 2010 16:00:19 GMT

Hi Andy,

We are currently indexing about 650,000 full-text books in per Solr/Lucene index.   We have
10 shards  for a total of about 6.5 million documents and our average response time is under
a 2 seconds, but the slowest 1% of queries take between 5-30 seconds.  If you were searching
only one index of 650,000 documents instead of the 6.5 million, the response time would quite
a bit better. If you only allow boolean "AND" queries and use stopwords, the response time
would be significantly better.  Our slowest searches are almost all phrase queries with common
words.   

You probably need to define what you mean by "searched quickly" and what kind of a load you
are expecting. Also you need to think about what kind of hardware you want to use. Also as
index sizes get large,disk I/O can become a bottleneck.   Using more memory for the OS disk
cache and Solr/Lucene caches can compensate for this. Using SSD's instead of Hard Disks can
also offset this as Toke can tell you about.  If you need to do frequent index updates it
can invalidate both the OS I/O cache and the Solr/Lucene caches, so there are lots of trade-offs
to tune. 

Lucene had a limit of about 2.4 billion unique terms per segment, which we ran into because
we have dirty OCR and 200 languages (http://www.hathitrust.org/blogs/large-scale-search/too-many-words).
 However Michael McCandless changed the limit to about 274 billion unique terms.  Chances
are you will run into bottlenecks with disk I/O  or other bottlenecks, long before you reach
this limit.

BTW: we index whole books as Solr documents, not chapters or pages.

Tom 
www.hathitrust.org/blogs
________________________________________





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message