lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greenhorn Techie <greenhorntec...@gmail.com>
Subject Re: Indexing throughput
Date Wed, 02 May 2018 20:45:50 GMT
Thanks Walter and Erick for the valuable suggestions. We shall try out
various values for shards and as well other tuning metrics I discussed in
various threads earlier.

Kind Regards


On 2 May 2018 at 18:24:31, Erick Erickson (erickerickson@gmail.com) wrote:

I've seen 1.5 M docs/second. Basically the indexing throughput is gated
by two things:
1> the number of shards. Indexing throughput essentially scales up
reasonably linearly with the number of shards.
2> the indexing program that pushes data to Solr. Before thinking Solr
is the bottleneck, check how fast your ETL process is pushing docs.

This pre-supposes using SolrJ and CloudSolrClient for the final push
to Solr. This pre-buckets the updates and sends the updates for each
shard to the shard leader, thus reducing the amount of work Solr has
to do. If you use SolrJ, you can easily do <2> above by just
commenting out the single call that pushes the docs to Solr in your
program.

Speaking of which, it's definitely best to batch the updates, see:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Best,
Erick

On Wed, May 2, 2018 at 10:07 AM, Walter Underwood <wunder@wunderwood.org>
wrote:
> We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb
RAM each
> (EC2 C4.8xlarge). The collection is 24 million documents with four
shards. The cluster
> is Solr 6.6.2. All storage is SSD EBS.
>
> We built a simple batch loader in Java. We get about one million
documents per minute
> with 64 threads. We do not use the cloud-smart SolrJ client. We just send
all the
> batches to the load balancer and let Solr sort it out.
>
> You are looking for 3 million documents per minute. You will just have to
test that.
>
> I haven’t tested it, but indexing should speed up linearly with the
number of shards,
> because those are indexing in parallel.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
>> On May 2, 2018, at 9:58 AM, Greenhorn Techie <greenhorntechie@gmail.com>
wrote:
>>
>> Hi,
>>
>> The current hardware profile for our production cluster is 20 nodes,
each
>> with 24cores and 256GB memory. Data being indexed is very structured in
>> nature and is about 30 columns or so, out of which half of them are
>> categorical with a defined list of values. The expected peak indexing
>> throughput is to be about *50000* documents per second (expected to be
done
>> at off-peak hours so that search requests will be minimal during this
time)
>> and the average throughput around *10000* documents (normal business
>> hours).
>>
>> Given the hardware profile, is it realistic and practical to achieve the
>> desired throughput? What factors affect the performance of indexing
apart
>> from the above hardware characteristics? I understand that its very
>> difficult to provide any guidance unless a prototype is done. But
wondering
>> what are the considerations and dependencies we need to be aware of and
>> whether our throughput expectations are realistic or not.
>>
>> Thanks
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message