couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Stockton <chrisstockto...@gmail.com>
Subject Re: CouchDB Crash report db_not_found when attempting to replicate databases
Date Tue, 13 Sep 2011 19:19:19 GMT
Hello,

On Tue, Sep 13, 2011 at 11:44 AM, Max Ogden <max@maxogden.com> wrote:
> Hi Chris,
>
> From what I understand the current state of the replicator (as of 1.1) is
> that for certain types of collections of documents it can be somewhat
> fragile. In the case of the node.js package repository, http://npmjs.org,
> there are many relatively large (~100MB) documents that would sometimes
> throw errors or timeout during replication and crash the replicator, at
> which point the replicator would restart and attempt to pick up where it
> left off. I am not an expert in the internals of the replicator but
> apparently the cumulative time required for the replicator to repeatedly
> crash and then subsequently relocate itself in _changes feed in the case of
> replicating the node package manager was making the built in couch
> replicator unusable for the task.
>

First of all I thank you for your response, I appreciate your time. We
have had a rocky road with replication as well, everything from system
limits to single document/view/reduce errors causing processes to
spawn wildly crippling machines. We have slowly worked through them by
upping system limits and erlang VM limits.

I feel like the absolute root cause of our problem is that we scale
via many smaller databases instead of a single large one. We are at
about 4200 databases right now and its painful to netstat -nap|grep
beam|wc -l and see 4200 active tcp connections. I have brought up
suggestions and comments in the past about server wide replication,
with some simple filtering function so a small pool of tcp connections
and processes could be used, greatly improving our scaling pattern of
many, small databases. I would be able to allocate time to try to
contribute some kinda patch to do this, but I simply do not know
erlang and it is very far from the languages I know (c, java, php,
anything close to these.. erlang is a entirely different world)

I have thought about changing our replication processes to only do
single pass non-continuous replication, currently they manage and
reconcile dropped replication tasks by monitoring status, using the
continuous =true flag, but I may need to drop that at the cost of
possible data loss if we get a crash in between passes.

> Two solutions exist that I know of. There is a new replicator in trunk (not
> to be confused with the _replicator db from 1.1 -- it is still using the old
> replicator algorithms) and there is also a more reliable replicator written
> in node.js https://github.com/mikeal/replicate that was was written
> specifically to replicate the node package repository between hosting
> providers.
>

Is there any documentation on this? Although I have heard good things
I am not familiar with node.js, I am interested in any alternatives
that better fit our use cases. At the end of the day stability, data
consistency and reliability for our customers for me is the biggest
concern, right now we don't have that and it's what I'm aiming for, no
more 2AM noc phone calls is the goal! :- )

> Additionally it may be useful if you could describe the 'fingerprint' of
> your documents a bit. How many documents are in the failing databases? are
> the documents large or small? do they have many attachments? how large is
> your _changes feed?
>

The failing databases do not share a common signature, some are very
small, maybe 10 total documents, some may have more then 10 thousand.
Some have had no changes for a very long time, some are recent. The
failures shared no common ground based off my observations.

Additional info:
  - We have around 4200 databases
  - The typical document is under 2kb, they are basically "table"
rows, simple key/value pairs
  - The changes feed is pretty small on most databases experiencing issues
  - We compact databases which had changes each night
  - A small percent, like 10% has attachments, they seem to not be
related to our issues

I am going to look into some of the alternative replicators you have
given me, feel free to give any specific suggestions based on the
above info.

Thanks,

-Chris

Mime
View raw message