couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Filipe David Manana <fdman...@apache.org>
Subject Re: How to tell if replication is caught up?
Date Tue, 22 Mar 2011 20:02:49 GMT
Hi Wayne,

On Tue, Mar 22, 2011 at 7:27 PM, Wayne Conrad <wayne@databill.com> wrote:
> My largest, ~600GB database was awful to compact.  Because much of it seldom
> changes, I shared that database by account, yielding about 500 databases of
> various sizes.  With a compaction daemon that only compacts a database when
> it grows, compaction is no longer a problem.  However, I appear to be
> suffering now when it comes to replication.
>
> Five hundred continuous "pull" replications have the destination database
> crying for mercy.  Its four CPUs are continously busy (load average ~4) and
> requests to the destination database occasionally time out.
>
> The replication script starts a "pull" replication for each database, one at
> a time.  The replication requests start out taking about 0.3 seconds per
> database, but towards the end of the list each reques is taking many
> seconds.
>
> Shortly after the replication starts, before it's got past more than a few
> dozen database, there is a brief flood of stack traces (or whatever Erlang
> calls them) in the destination couch log.  I think there are fewer lines of
> error info than there are atoms in the sun, but only just.  Is there a guide
> that can help me know which lines of that log you need to know?
>
> The source database is not suffering: It's load average is < 1 and it serves
> requests quickly.
>
> Due to the number of databases, I've added "ulimit -n 32768" to the startup
> script.
>
> We're running version 1.2.0ac052866-git on linux 2.6.32.  This version has
> the new replicator.
>
> * Are we "doing it all wrong?"
>
> * Can I expect the storm to abate once all of the replications are caught
> up?
>
> * How can I tell which replications are "caught up?"  I see that a GET to
> /_active_tasks tells me that some replication tasks are "Starting" and
> others have, e.g., "Processed source seq 17", but I don't know if this is
> enough to know what's caught up and what's not.  Do I have to query the
> source database somehow to find out what source sequence is available?

You can consult the checkpoints either in the target or the source
database. The checkpoints are documents with IDs like
_local/replication_id. The replication_id is what you see in the log
file and _active_tasks. Those checkpoints have useful information in
them.

Also, can you share the logs? I would like to see the errors and stack
traces you get - without them it's hard to tell what is going wrong.

However, 500 replications in parallel seems a bit too much.

cheers
>
> Best Regards,
> Wayne Conrad
>



-- 
Filipe David Manana,
fdmanana@gmail.com, fdmanana@apache.org

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

Mime
View raw message