lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Obernberger <joseph.obernber...@gmail.com>
Subject Re: Large index recommendation
Date Mon, 16 Jan 2017 16:57:43 GMT
Thank you all for the responses; very helpful!  In this configuration, 
by far, most of the time is spent indexing new data rather than 
searching.  When I look at a solr instance, it is almost always using 
100% CPU when no searches are being executed. Given that the machines 
have many cores, going with multiple shards per machines sounds like the 
way to go here.

I do have a test instance, and can do some bench-marking there. Thanks 
again!


-Joe


On 1/13/2017 4:16 PM, Toke Eskildsen wrote:
> Joe Obernberger <joseph.obernberger@gmail.com> wrote:
>
> [3 billion docs / 16TB / 27 shards on HDFS times 3 for replication]
>
>> Each shard is then hosting about 610GBytes of index.  The HDFS cache
>> size is very low at about 8GBytes.  Suffice it to say, performance isn't
>> very good, but again, this is for experimentation.
> We are running a setup with local SSDs where our shards are 900GB with ~6GB free for
disk cache for each. But the shards are static and fully optimized. If your data are non-changing
time-series, you might want to consider a model with dedicated search-only and shard-build-only
nodes to lower hardware requirements.
>
>> If we were to redo this, would it be better to create many shards -
>> maybe 200 with 3 replicas each (600 in all) with the goal being to
>> withstand a server going out, and future expansion as more hardware is
>> added?  I know this is very general question.  Thanks very much in advance!
> As Erick says then you are in the fortunate position that it is reasonable easy for you
to prototype and extrapolate, as you are scaling out. I will add that you should keep in mind
that under the constraint of constant hardware, you might win latency by sharding, but you
will loose maximum throughput (due to increased duplicate information and increased logistics).
Can you just add hardware, then do so. If you have under-utilized CPUs and a need for lower
latency, then try more shards on existing hardware (the Scott@FullStory slides that Susheel
mentions seems fitting here).
>
> - Toke Eskildsen


Mime
View raw message