lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Solr cloud performance degradation with billions of documents
Date Wed, 13 Aug 2014 21:17:27 GMT
Could you clarify what you mean with the term "cloud", as in "per cloud" and 
"individual clouds"? That's not a proper Solr or SolrCloud concept per se. 
SolrCloud works with a single "cluster" of nodes. And there is no 
interaction between separate SolrCloud clusters.

-- Jack Krupansky

-----Original Message----- 
From: Wilburn, Scott
Sent: Wednesday, August 13, 2014 5:08 PM
To: solr-user@lucene.apache.org
Subject: Solr cloud performance degradation with billions of documents

Hello everyone,
I am trying to use SolrCloud to index a very large number of simple 
documents and have run into some performance and scalability limitations and 
was wondering what can be done about it.

Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the 
Solr shards and each node has 128GB of memory. The current SolrCloud setup 
is split into 4 separate and individual clouds of 32 shards each thereby 
giving four running shards per cloud or one cloud per eight nodes. Each 
shard is currently assigned a 6GB heap size. I’d prefer to avoid increasing 
heap memory for Solr shards to have enough to run other MapReduce jobs on 
the cluster.

The rate of documents that I am currently inserting into these clouds per 
day is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion 
into the fourth ; however to account for capacity, the aim is to scale the 
solution to support double that amount of documents. To index these 
documents, there are MapReduce jobs that run that generate the Solr XML 
documents and will then submit these documents via SolrJ's CloudSolrServer 
interface. In testing, I have found that limiting the number of active 
parallel inserts to 80 per cloud gave the best performance as anything 
higher gave diminishing returns, most likely due to the constant shuffling 
of documents internally to SolrCloud. From an index perspective, dated 
collections are being created to hold an entire day's of documents and 
generally the inserting happens primarily on the current day (the previous 
days are only to allow for searching) and the plan is to keep up to 60 days 
(or collections) in each cloud. A single shard index in one collection in 
the busiest cloud currently takes up 30G disk space or 960G for the entire 
collection. The documents are being auto committed with a hard commit time 
of 4 minutes (opensearcher = false) and soft commit time of 8 minutes.

>From a search perspective, the use case is fairly generic and simple 
searches of the type :, so there is no need to tune the system to use any of 
the more advanced querying features. Therefore, the most important thing for 
me is to have the indexing performance be able to keep up with the rate of 
input.

In the initial load testing, I was able to achieve a projected indexing rate 
of 10 Billion documents per cloud per day for a grand total of 40 Billion 
per day. However, the initial load testing was done on fairly empty clouds 
with just a few small collections. Now that there have been several days of 
documents being indexed, I am starting to see a fairly steep drop-off in 
indexing performance once the clouds reached about 15 full collections (or 
about 80-100 Billion documents per cloud) in the two biggest clouds. Based 
on current application logging I’m seeing a 40% drop off in indexing 
performance. Because of this, I have concerns on how performance will hold 
as more collections are added.

My question to the community is if anyone else has had any experience in 
using Solr at this scale (hundreds of Billions) and if anyone has observed 
such a decline in indexing performance as the number of collections 
increases. My understanding is that each collection is a separate index and 
therefore the inserting rate should remain constant. Aside from that, what 
other tweaks or changes can be done in the SolrCloud configuration to 
increase the rate of indexing performance? Am I hitting a hard limitation of 
what Solr can handle?

Thanks,
Scott


Mime
View raw message