From user-return-17895-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Tue Sep 13 19:30:58 2011 Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 902E187F8 for ; Tue, 13 Sep 2011 19:30:58 +0000 (UTC) Received: (qmail 87345 invoked by uid 500); 13 Sep 2011 19:30:57 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 87308 invoked by uid 500); 13 Sep 2011 19:30:56 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 87282 invoked by uid 99); 13 Sep 2011 19:30:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Sep 2011 19:30:55 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_FROM,HTML_FONT_FACE_BAD,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of maxosmail@gmail.com designates 209.85.215.44 as permitted sender) Received: from [209.85.215.44] (HELO mail-ew0-f44.google.com) (209.85.215.44) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Sep 2011 19:30:50 +0000 Received: by ewy19 with SMTP id 19so708508ewy.31 for ; Tue, 13 Sep 2011 12:30:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; bh=W3r0GXzwgwzepfgyfKgJyz6QTV6MkZY87W9dtxU1SjY=; b=JMOC6TgdWZMCgD9pJouaHPydm1ED6SfBivJmZJfaWqcikfTvY5CP2w8tE0KFCRhWBV k10HzMcEH1ZjiwU7t/qyftghhqTezEjkMH6ykeuB/6Iyoe94BsLrXkRmcBzpqLkAiWkl 6IaIvZYMwgxKPOyzghXiB1MamfTuO5codJ0Wg= Received: by 10.52.89.165 with SMTP id bp5mr1652336vdb.339.1315942229086; Tue, 13 Sep 2011 12:30:29 -0700 (PDT) MIME-Version: 1.0 Sender: maxosmail@gmail.com Received: by 10.52.169.131 with HTTP; Tue, 13 Sep 2011 12:30:09 -0700 (PDT) In-Reply-To: References: From: Max Ogden Date: Tue, 13 Sep 2011 12:30:09 -0700 X-Google-Sender-Auth: f6SsW-OS92u8641jneH_9eUH308 Message-ID: Subject: Re: CouchDB Crash report db_not_found when attempting to replicate databases To: user@couchdb.apache.org Content-Type: multipart/alternative; boundary=20cf307f35141c134404acd7ad01 X-Virus-Checked: Checked by ClamAV on apache.org --20cf307f35141c134404acd7ad01 Content-Type: text/plain; charset=UTF-8 Hi Chris, after installing https://github.com/joyent/node and http://npmjs.org/ you can simply do npm install replicate and then replicate http://sourcecouch/db http://destinationcouch/db it will simply return a 'success' message when it completes. there isn't any progress monitoring output yet. there also isnt support for continuous replication or alternatively you could write custom node.js code for finer-grained behavior. cheers, max On Tue, Sep 13, 2011 at 12:19 PM, Chris Stockton wrote: > Hello, > > On Tue, Sep 13, 2011 at 11:44 AM, Max Ogden wrote: > > Hi Chris, > > > > From what I understand the current state of the replicator (as of 1.1) is > > that for certain types of collections of documents it can be somewhat > > fragile. In the case of the node.js package repository, http://npmjs.org > , > > there are many relatively large (~100MB) documents that would sometimes > > throw errors or timeout during replication and crash the replicator, at > > which point the replicator would restart and attempt to pick up where it > > left off. I am not an expert in the internals of the replicator but > > apparently the cumulative time required for the replicator to repeatedly > > crash and then subsequently relocate itself in _changes feed in the case > of > > replicating the node package manager was making the built in couch > > replicator unusable for the task. > > > > First of all I thank you for your response, I appreciate your time. We > have had a rocky road with replication as well, everything from system > limits to single document/view/reduce errors causing processes to > spawn wildly crippling machines. We have slowly worked through them by > upping system limits and erlang VM limits. > > I feel like the absolute root cause of our problem is that we scale > via many smaller databases instead of a single large one. We are at > about 4200 databases right now and its painful to netstat -nap|grep > beam|wc -l and see 4200 active tcp connections. I have brought up > suggestions and comments in the past about server wide replication, > with some simple filtering function so a small pool of tcp connections > and processes could be used, greatly improving our scaling pattern of > many, small databases. I would be able to allocate time to try to > contribute some kinda patch to do this, but I simply do not know > erlang and it is very far from the languages I know (c, java, php, > anything close to these.. erlang is a entirely different world) > > I have thought about changing our replication processes to only do > single pass non-continuous replication, currently they manage and > reconcile dropped replication tasks by monitoring status, using the > continuous =true flag, but I may need to drop that at the cost of > possible data loss if we get a crash in between passes. > > > Two solutions exist that I know of. There is a new replicator in trunk > (not > > to be confused with the _replicator db from 1.1 -- it is still using the > old > > replicator algorithms) and there is also a more reliable replicator > written > > in node.js https://github.com/mikeal/replicate that was was written > > specifically to replicate the node package repository between hosting > > providers. > > > > Is there any documentation on this? Although I have heard good things > I am not familiar with node.js, I am interested in any alternatives > that better fit our use cases. At the end of the day stability, data > consistency and reliability for our customers for me is the biggest > concern, right now we don't have that and it's what I'm aiming for, no > more 2AM noc phone calls is the goal! :- ) > > > Additionally it may be useful if you could describe the 'fingerprint' of > > your documents a bit. How many documents are in the failing databases? > are > > the documents large or small? do they have many attachments? how large is > > your _changes feed? > > > > The failing databases do not share a common signature, some are very > small, maybe 10 total documents, some may have more then 10 thousand. > Some have had no changes for a very long time, some are recent. The > failures shared no common ground based off my observations. > > Additional info: > - We have around 4200 databases > - The typical document is under 2kb, they are basically "table" > rows, simple key/value pairs > - The changes feed is pretty small on most databases experiencing issues > - We compact databases which had changes each night > - A small percent, like 10% has attachments, they seem to not be > related to our issues > > I am going to look into some of the alternative replicators you have > given me, feel free to give any specific suggestions based on the > above info. > > Thanks, > > -Chris > --20cf307f35141c134404acd7ad01--