lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: Indexing throughput
Date Wed, 02 May 2018 23:07:29 GMT
On 5/2/2018 10:58 AM, Greenhorn Techie wrote:
> The current hardware profile for our production cluster is 20 nodes, each
> with 24cores and 256GB memory. Data being indexed is very structured in
> nature and is about 30 columns or so, out of which half of them are
> categorical with a defined list of values. The expected peak indexing
> throughput is to be about *50000* documents per second (expected to be done
> at off-peak hours so that search requests will be minimal during this time)
> and the average throughput around *10000* documents (normal business
> hours).
>
> Given the hardware profile, is it realistic and practical to achieve the
> desired throughput? What factors affect the performance of indexing apart
> from the above hardware characteristics? I understand that its very
> difficult to provide any guidance unless a prototype is done. But wondering
> what are the considerations and dependencies we need to be aware of and
> whether our throughput expectations are realistic or not.

50000 docs per second is not a slow indexing rate.  It has been
achieved, and as Erick noted, surpassed by a very large margin.  Whether
you can get there with your planned hardware on your index is not a
question that I can answer.  If I had to guess, I think that as long as
the source system can push the data that fast, it SHOULD be possible to
create an indexing system that can do it.

The important thing to do for fast indexing with Solr is to have a lot
of threads or processes indexing all at the same time.  Indexing with a
single thread will not achieve the fastest possible performance.

Since you're planning SolrCloud, you should put some effort into having
your indexing system be aware of your cluster state and the shard
routing so that it can send indexing requests directly to shard
leaders.  Indexing is faster if Solr doesn't need to forward requests. 
The SolrJ client named "CloudSolrClient" is always aware of the
clusterstate.  So if you can use that, updates can always be sent to the
leaders.

Thanks,
Shawn


Mime
View raw message