Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of doug@interactivemediums.com
 designates 209.85.220.180 as permitted sender)
MIME-Version: 1.0
Date: Thu, 28 Apr 2011 11:55:46 -0500
Message-ID: <BANLkTimA0qLPqdEmB-A7R0GmV=MSzo+0zg@mail.gmail.com>
Subject: Replication forgets where it is
From: Doug Barth <doug@interactivemediums.com>
To: user@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Hi,

We are currently in the process of migrating a huge CouchDB database
(240GB) to new machines (master and slave). Our plan of attack was to
scp the raw data files over to the new master, start pull replication
on that new server to grab changes from originating server. Once that
new server is caught up we, we would shut down couch on the
replacement master server and scp the files over to the new slave
server. We started replication on both servers (originating <- new
master <- new slave) and wait for everyone to catch up. Once caught
up, we will schedule a short downtime to cut writes over to the
replacement master and shut down the originating server.

Copying the files over was painless. We copied them from underneath
the running originating server. Once they were over, CouchDB came up
instantly and everything looked good. The replication process, on the
other hand, has been a world of hurt.

First, starting the replication process resulted in Couch spending
about 4 days working its way through the database on disk to figure
out the update sequence that it should start replication from. As I
understand it, this is delay makes some sense because the Couch data
file doesn't have a checkpoint noted for that particular combination
of source and target databases. During this period, the disk was doing
a lot of ops (~250 ops/sec), but reading very little data (~1.5MB/sec)

Once the new master was caught up, I shut down Couch, and copied the
latest files over to the new slave server. I brought both servers back
up and started replication. The slave server I expected to do the 4
day "where am I" process, but the replacement master server also
forgot where it was and started from scratch again.

During the course of trying to speed up this process, I tried
canceling the replication and then bringing down the server (to shut
down a compaction job). Again, I brought up the server to find out it
forgot where it was.

If I look in the logs, here is what I see.

[Thu, 28 Apr 2011 15:55:05 GMT] [info] [<0.149.0>] Replication records
differ. Scanning histories to find a common ancestor.

[Thu, 28 Apr 2011 15:55:05 GMT] [info] [<0.149.0>] no common ancestry
-- performing full replication

[Thu, 28 Apr 2011 15:55:05 GMT] [info] [<0.121.0>] starting new
replication "c0872a8cc263f23fa1a47d35c784591b+continuous" at <0.149.0>

The database stats are as follows:
  * Size: 251.5 GB
  * # of docs: 65,984,336
  * update seq: 66,185,521

The new machines are pretty beefy: 6xRAID 10 15K drives, 32GB of RAM
(31GB of which is being used for disk cache), 8 cores.

So, my question, why does Couch forget where it last checkpoint-ed its
replication and why does it take so long for it to figure out where it
should begin?