Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CAJ_m3YBdgdT70HkLKoxHGdb-RMXqzu1oEObA9bYVe5hQt4M-fw@mail.gmail.com>
References: 
 <CALKFbxtR=m3QmXnzvEOSamQ6qRow48Sg+J1RHcoAMmwpzs4zrw@mail.gmail.com>
 <CAL3q7H4ryQJATque+A1v0kbHqfSrSoKZ8kNUaoMDR-6NPaH-Kg@mail.gmail.com>
 <CALKFbxt0uBsDD3G_wjfTuOVA147Vosj_JWS9QJHft+-h1vqSZg@mail.gmail.com>
 <2903A6F8-BF28-4DA8-80CC-B7B45ADD4057@apache.org>
 <CALKFbxuMf-zCJaOAod4ZC1+QLjfk=kJsvoP0UgPVhKFUBmh4NA@mail.gmail.com>
 <CAJ_m3YBdgdT70HkLKoxHGdb-RMXqzu1oEObA9bYVe5hQt4M-fw@mail.gmail.com>
From: Mark Hahn <mark@boutiquing.com>
Date: Tue, 11 Oct 2011 05:20:55 -0700
Message-ID: 
 <CACXzkS-ozdjUxsPs0zaamSrRvKvTTJRzW5nCyPjZT+D5_58hZA@mail.gmail.com>
Subject: Re: CouchDB Replication lacking resilience for many database
To: user@couchdb.apache.org
Cc: dev@couchdb.apache.org
Content-Type: multipart/alternative; boundary=001636c5bed6348dc204af04f27e

--001636c5bed6348dc204af04f27e
Content-Type: text/plain; charset=ISO-8859-1

It would be nice to have a control panel that displays things like this
message queue depth, connection counts, memory consumed, cpu consumed,
reads/writes per second, view rebuilds/sec, avg response times, etc.  I'm
sure someone could come up with many more pertinent vars.

For extra credit the values could be plotted against time.  When someone has
a problem they could post the log here.

On Mon, Oct 10, 2011 at 10:15 PM, Paul Davis <paul.joseph.davis@gmail.com>wrote:

> On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton
> <chrisstocktonaz@gmail.com> wrote:
> > Hello,
> >
> > On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <kocolosk@apache.org>
> wrote:
> >> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
> >>
> >>> Hello,
> >>>
> >>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
> >>> <fdmanana@apache.org> wrote:
> >>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
> >>>> <chrisstocktonaz@gmail.com> wrote:
> >>>> Chris,
> >>>>
> >>>> That said work is in the'1.2.x' branch (and master).
> >>>> CouchDB recently migrated from SVN to GIT, see:
> >>>> http://couchdb.apache.org/community/code.html
> >>>>
> >>>
> >>> Thank you very much for the response Filipe, do you possibly have any
> >>> documentation or more detailed summary on what these changes include
> >>> and possible benefits of them? I would love to hear about any tweaking
> >>> or replication tips you may have for our growth issues, perhaps you
> >>> could answer a basic question if nothing else: Do the changes in this
> >>> branch minimize the performance impact of continuous replication on
> >>> many databases?
> >>>
> >>> Regardless I plan on getting a build of that branch and doing some
> >>> testing of my own very soon.
> >>>
> >>> Thank you!
> >>>
> >>> -Chris
> >>
> >> I'm pretty sure that even in 1.2.x and master each replication with a
> remote source still requires one dedicated TCP connection to consume the
> _changes feed.  Replications with a local source have always been able to
> use a connection pool per host:port combination.  That's not to downplay the
> significance of the rewrite of the replicator in 1.2.x; Filipe put quite a
> lot of time into it.
> >>
> >> The link to "those darn errors" just pointed to the mbox browser for
> September 2011.  Do you have a more specific link?  Regards,
> >>
> >> Adam
> >
> > Well I will remain optimistic that the rewrite could hopefully have
> > solved several of my issues regardless I hope. I guess the idle TCP
> > connections by themselves are not too bad, when they all start to work
> > simultaneously I think is what becomes the issue =)
> >
> > Sorry Adam, here is a better link
> >
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E
> ,
> > the actual text was:
> >
> > ---------------
> >
> > It seems that randomly I am getting errors about crashes as our
> > replicator runs, all this replicator does is make sure that all
> > databases on the master server replicate to our failover by checking
> > status.
> >
> > Details:
> >  - I notice the below error in the logs, anywhere from 0 to 30 at a time.
> >  - It seems that a database might start replicating okay then stop.
> >  - These errors [1] are on the failover pulling from master
> >  - No errors are displayed on the master server
> >  - The databases inside the URL in the db_not_found portion of the
> > error, are always available from curl from the failover machine, which
> > makes the error strange, somehow it thinks it can't find the database
> >  - Master seems healthy at all times, all database are available, no
> > errors in log
> >
> > [1] --
> >  [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
> > {error_report,<0.30.0>,
> >                          {<0.22466.5305>,crash_report,
> >
> [[{initial_call,{couch_rep,init,['Argument__1']}},
> >                             {pid,<0.22466.5305>},
> >                             {registered_name,[]},
> >                             {error_info,
> >                              {exit,
> >                               {db_not_found,
> >                                <<"http://user:pass@server
> :5984/db_10944/">>},
> >                               [{gen_server,init_it,6},
> >                                {proc_lib,init_p_do_apply,3}]}},
> >                             {ancestors,
> >                              [couch_rep_sup,couch_primary_services,
> >                               couch_server_sup,<0.31.0>]},
> >                             {messages,[]},
> >                             {links,[<0.81.0>]},
> >                             {dictionary,[]},
> >                             {trap_exit,true},
> >                             {status,running},
> >                             {heap_size,2584},
> >                             {stack_size,24},
> >                             {reductions,794}],
> >                            []]}}
> >
>
> One place I've seen this error pop up when it looks like it shouldn't
> is if couch_server gets backed up. If you remsh into one of those db's
> you could try the following:
>
>    > process_info(whereis(couch_server), message_queue_len).
>
> And if that number keeps growing, that could be the issue.
>

--001636c5bed6348dc204af04f27e--