couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: Cluster Replication batch_size and batch_count Modification
Date Mon, 05 Jun 2017 23:05:17 GMT
The answer to your clarifying question is absolutely yes. The “pending_changes” metric
refers to the number of committed changes on the shard replica emitting the log event that
need to be cross-checked on another replica. It’s not a measure of writes that need to be
executed.

Cheers, Adam

> On Jun 5, 2017, at 4:37 PM, Phil May <phil.may@motorolasolutions.com> wrote:
> 
> Hi Adam,
> 
> Thanks for the info!
> 
> When we run at high write rates, we will start to fall behind, but when we
> reduce the rate, we eventually catch up.
> 
> I have a clarification question – can the warning messages we are seeing
> still occur in a healthy cluster due to the "redundant cross-check" taking
> long enough that more changes have accumulated that now also need to be
> cross-checked (even when no actual writes were needed)?
> 
> We have had some luck modifying sync_concurrency (which is exposed in the
> .ini file) and batch_size (which we exposed), and that does give us more
> throughput capacity.
> 
> Thanks!
> 
> - Phil
> 
> 
> On Mon, Jun 5, 2017 at 11:38 AM, Adam Kocoloski <kocolosk@apache.org> wrote:
> 
>> Hi Phil,
>> 
>> Here’s the thing to keep in mind about those warning messages: in a
>> healthy cluster, the internal replication traffic that generates them is
>> really just a redundant cross-check. It exists to “heal” a cluster member
>> that was down during some write operations. When you write data into a
>> CouchDB cluster the copies are written to all relevant shard replicas
>> proactively.
>> 
>> If your cluster’s steady-state write load is causing internal cluster
>> replication to fall behind permanently, that’s problematic. You should tune
>> the cluster replication parameters to give it more throughput. If the
>> replication is only falling behind during some batch data load and then
>> catches up later it may be a different story. You may want to keep things
>> configured as-is.
>> 
>> Does that make sense?
>> 
>> Cheers, Adam
>> 
>>> On Jun 4, 2017, at 11:06 PM, Phil May <phil.may@motorolasolutions.com>
>> wrote:
>>> 
>>> I'm writing to check whether modifying replication batch_count and
>>> batch_size parameters for cluster replication is good idea.
>>> 
>>> Some background – our data platform dev team noticed that under heavy
>> write
>>> load, cluster replication was falling behind. The following warning
>>> messages started appearing in the logs, and the pending_changes value
>>> consistently increased while under load.
>>> 
>>> [warning] 2017-05-18T20:15:22.320498Z couch-1@couch-1.couchdb <0.316.0>
>>> -------- mem3_sync shards/a0000000-bfffffff/test.1495137986
>>> couch-3@couch-3.couchdb
>>> {pending_changes,474}
>>> 
>>> What we saw is described in COUCHDB-3421
>>> <https://issues.apache.org/jira/browse/COUCHDB-3421>. In addition,
>> CouchDB
>>> appears to be CPU bound while this is occurring, not I/O bound as would
>>> seem reasonable to expect for replication.
>>> 
>>> When we looked into this, we discovered in the source two values
>> affecting
>>> replication, batch_size and batch_count. For cluster replication, these
>>> values are fixed at 100 and 1 respectively, so we made them configurable.
>>> We tried various values and it seems increasing the batch_size (and to a
>>> lesser extent) batch_count improves our write performance. As a point of
>>> reference, with batch_count=50 and batch_size=5000 we can handle about
>>> double the write throughput with no warnings. We are experimenting with
>>> other values.
>>> 
>>> We wanted to know if adjusting these parameters is a sound approach.
>>> 
>>> Thanks!
>>> 
>>> - Phil
>> 
>> 


Mime
View raw message