Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
Subject: Re: Multiple database backup strategy
From: Robert Samuel Newson <rnewson@apache.org>
In-Reply-To: <E9847E23-B889-48C3-B670-2571599E52E0@apache.org>
Date: Sat, 19 Mar 2016 18:36:13 +0000
Content-Transfer-Encoding: quoted-printable
Message-Id: <2B92BD0A-427D-4C0D-A842-2368C35DC5D2@apache.org>
References: 
 <CAMKMHYjGyNXHd-oHi9LyvoKZCZcnQzrPQ0GpeGbVyypZzTg=Yw@mail.gmail.com>
 <139A3579-EDE8-4DE3-B432-860C13B553EA@apache.org>
 <E9847E23-B889-48C3-B670-2571599E52E0@apache.org>
To: dev@couchdb.apache.org

Hi,

The problem is that _db_updates is not guaranteed to see every update, =
so I think it falls at the first hurdle.

What couch_replicator_manager does in couchdb 2.0 (though not in the =
version that Cloudant originally contributed) is to us ecouch_event, =
notice which are to _replicator shards, and trigger management work from =
that.

Some work I'm embarking on, with a few other devs here at Cloudant, is =
to enhance the replicator manager to not run all jobs at once and it is =
indeed the plan to have each of those jobs run for a while, kill them =
(they checkpoint then close all resources) and reschedule them later. =
It's TBD whether we'd always strip feed=3Dcontinuous from those. We =
_could_ let each job run to completion (i.e, caught up to the source db =
as of the start of the replication job) but I think we have to be a bit =
smarter and allow replication jobs that constantly have work to do (i.e, =
the source db is always busy), to run as they run today, with =
feed=3Dcontinuous, unless forcibly ousted by a scheduler due to some =
configuration concurrency setting.

I note  for completeness that the work we're planning explicitly =
includes "multi database" strategies, you'll hopefully be able to make a =
single _replicator doc that represents your entire intention (e.g, =
"replicate _all_ dbs from server1 to server2").

B.


> On 14 Mar 2016, at 02:40, Adam Kocoloski <kocolosk@apache.org> wrote:
>=20
>=20
>> On Mar 10, 2016, at 3:18 AM, Jan Lehnardt <jan@apache.org> wrote:
>>=20
>>>=20
>>> On 09 Mar 2016, at 21:29, Nick Wood <nwood888@gmail.com> wrote:
>>>=20
>>> Hello,
>>>=20
>>> I'm looking to back up a CouchDB server with multiple databases. =
Currently
>>> 1,400, but it fluctuates up and down throughout the day as new =
databases
>>> are added and old ones deleted. ~10% of the databases are written to =
within
>>> any 5 minute period of time.
>>>=20
>>> Goals
>>> - Maintain a continual off-site snapshot of all databases, =
preferably no
>>> older than a few seconds (or minutes)
>>> - Be efficient with bandwidth (i.e. not copy the whole database file =
for
>>> every backup run)
>>>=20
>>> My current solution watches the global _changes feed and fires up a
>>> continuous replication to an off-site server whenever it sees a =
change. If
>>> it doesn't see a change from a database for 10 minutes, it kills =
that
>>> replication. This means I only have ~150 active replications running =
on
>>> average at any given time.
>>=20
>> How about instead of using continuous replications and killing them,
>> use non-continuous replications based on _db_updates? They end
>> automatically and should use fewer resources then.
>>=20
>> Best
>> Jan
>> --
>=20
> In my opinion this is actually a design we should adopt for =
CouchDB=E2=80=99s own replication manager. Keeping all those _changes =
listeners running is needlessly expensive now that we have _db_updates.
>=20
> Adam