lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <>
Subject RE: Can Apache Solr Handle TeraByte Large Data
Date Mon, 16 Jan 2012 22:00:35 GMT
Hello ,

Searching real-time sounds difficult with that amount of data. With large documents, 3 million
documents, and 5TB of data the index will be very large. With indexes that large your performance
will probably be I/O bound.  

Do you plan on allowing phrase or proximity searches? If so, your performance will be even
more I/O bound as documents that large will have huge positions indexes that will need to
be read into memory for processing phrase queries. To reduce I/O you need as much of the index
in memory (Lucene/Solr caches, and operating system disk cache).  Every commit invalidates
the Solr/Lucene caches (unless the newer nrt code has solved this for Solr).  

If you index and serve on the same server, you are also going to get terrible response time
whenever your commits trigger a large merge.

If you need to service 10-100 qps or more, you may need to look at putting your index on SSDs
or spreading it over enough machines so it can stay in memory.

What kind of response times are you looking for and what query rate?

We have somewhat smaller documents. We have 10 million documents and about 6-8TB of data in
HathiTrust and have spread the index over 12 shards on 4 machines (i.e. 3 shards per machine).
  We get an average of around 200-300ms response time but our 95th percentile times are about
800ms and 99th percentile are around 2 seconds.  This is with an average load of less than
1 query/second.

As Otis suggested, you may want to implement a strategy that allows users to search within
the large documents by breaking the documents up into smaller units. What we do is have two
Solr indexes.  The first indexes complete documents.  When the user clicks on a result, we
index the entire document on a page level in a small Solr index on-the-fly.  That way they
can search within the document and get page level results.
More details about our setup:

Tom Burton-West
University of Michigan Library
-----Original Message-----

View raw message