lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <>
Subject [jira] [Commented] (SOLR-9936) Allow configuration for recoveryExecutor thread pool size
Date Fri, 14 Apr 2017 05:49:42 GMT


ASF subversion and git services commented on SOLR-9936:

Commit bc6ff493b09a1ec5454c5ce790f6b7ecb714743e in lucene-solr's branch refs/heads/master
from markrmiller
[;h=bc6ff49 ]

SOLR-9936: Allow configuration for recoveryExecutor thread pool size.

> Allow configuration for recoveryExecutor thread pool size
> ---------------------------------------------------------
>                 Key: SOLR-9936
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: replication (java)
>    Affects Versions: 6.3
>            Reporter: Tim Owen
>         Attachments: SOLR-9936.patch, SOLR-9936.patch
> There are two executor services in {{UpdateShardHandler}}, the {{updateExecutor}} whose
size is unbounded for reasons explained in the code comments. There is also the {{recoveryExecutor}}
which was added later, and is the one that executes the {{RecoveryStrategy}} code to actually
fetch index files and store to disk, eventually calling an {{fsync}} thread to ensure the
data is written.
> We found that with a fast network such as 10GbE it's very easy to overload the local
disk storage when doing a restart of Solr instances after some downtime, if they have many
cores to load. Typically we have each physical server containing 6 SSDs and 6 Solr instances,
so each Solr has its home dir on a dedicated SSD. With 100+ cores (shard replicas) on each
instance, startup can really hammer the SSD as it's writing in parallel from as many cores
as Solr is recovering. This made recovery time bad enough that replicas were down for a long
time, and even shards marked as down if none of its replicas have recovered (usually when
many machines have been restarted). The very slow IO times (10s of seconds or worse) also
made the JVM pause, so that disconnects from ZK, which didn't help recovery either.
> This patch allowed us to throttle how much parallelism there would be writing to a disk
- in practice we're using a pool size of 4 threads, to prevent the SSD getting overloaded,
and that worked well enough to make recovery of all cores in reasonable time.
> Due to the comment on the other thread pool size, I'd like some comments on whether it's
OK to do this for the {{recoveryExecutor}} though?
> It's configured in solr.xml with e.g.
> {noformat}
>   <updateshardhandler>
>     <int name="maxRecoveryThreads">${solr.recovery.threads:4}</int>
>   </updateshardhandler>
> {noformat}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message