couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil May <phil....@motorolasolutions.com>
Subject Re: Cluster Replication batch_size and batch_count Modification
Date Mon, 05 Jun 2017 20:37:08 GMT
Hi Adam,

Thanks for the info!

When we run at high write rates, we will start to fall behind, but when we
reduce the rate, we eventually catch up.

I have a clarification question – can the warning messages we are seeing
still occur in a healthy cluster due to the "redundant cross-check" taking
long enough that more changes have accumulated that now also need to be
cross-checked (even when no actual writes were needed)?

We have had some luck modifying sync_concurrency (which is exposed in the
.ini file) and batch_size (which we exposed), and that does give us more
throughput capacity.

Thanks!

- Phil


On Mon, Jun 5, 2017 at 11:38 AM, Adam Kocoloski <kocolosk@apache.org> wrote:

> Hi Phil,
>
> Here’s the thing to keep in mind about those warning messages: in a
> healthy cluster, the internal replication traffic that generates them is
> really just a redundant cross-check. It exists to “heal” a cluster member
> that was down during some write operations. When you write data into a
> CouchDB cluster the copies are written to all relevant shard replicas
> proactively.
>
> If your cluster’s steady-state write load is causing internal cluster
> replication to fall behind permanently, that’s problematic. You should tune
> the cluster replication parameters to give it more throughput. If the
> replication is only falling behind during some batch data load and then
> catches up later it may be a different story. You may want to keep things
> configured as-is.
>
> Does that make sense?
>
> Cheers, Adam
>
> > On Jun 4, 2017, at 11:06 PM, Phil May <phil.may@motorolasolutions.com>
> wrote:
> >
> > I'm writing to check whether modifying replication batch_count and
> > batch_size parameters for cluster replication is good idea.
> >
> > Some background – our data platform dev team noticed that under heavy
> write
> > load, cluster replication was falling behind. The following warning
> > messages started appearing in the logs, and the pending_changes value
> > consistently increased while under load.
> >
> > [warning] 2017-05-18T20:15:22.320498Z couch-1@couch-1.couchdb <0.316.0>
> > -------- mem3_sync shards/a0000000-bfffffff/test.1495137986
> > couch-3@couch-3.couchdb
> > {pending_changes,474}
> >
> > What we saw is described in COUCHDB-3421
> > <https://issues.apache.org/jira/browse/COUCHDB-3421>. In addition,
> CouchDB
> > appears to be CPU bound while this is occurring, not I/O bound as would
> > seem reasonable to expect for replication.
> >
> > When we looked into this, we discovered in the source two values
> affecting
> > replication, batch_size and batch_count. For cluster replication, these
> > values are fixed at 100 and 1 respectively, so we made them configurable.
> > We tried various values and it seems increasing the batch_size (and to a
> > lesser extent) batch_count improves our write performance. As a point of
> > reference, with batch_count=50 and batch_size=5000 we can handle about
> > double the write throughput with no warnings. We are experimenting with
> > other values.
> >
> > We wanted to know if adjusting these parameters is a sound approach.
> >
> > Thanks!
> >
> > - Phil
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message