Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\))
Subject: Re: CouchDB load spike (even with low traffic)?
From: Adam Kocoloski <kocolosk@apache.org>
In-Reply-To: <5527680828350014004@unknownmsgid>
Date: Wed, 30 Apr 2014 06:40:18 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <0EED3676-5AA4-497C-8BDB-8CB6B6FA4C0F@apache.org>
References: 
 <CAPG1hyN6uX5=B904OGXyTmwXfs8UQJhgAN8Tqhs62WzZ2dnpBw@mail.gmail.com>
 <CAHdjipJA+T6f8Cwe0NCCcPSi3cVCTmqfwJ9CtpjbRPJnwASE0w@mail.gmail.com>
 <CAPG1hyOJzpqyx1uQZ8Epu+Xm7T4L=w8PAj5X=fBV9zbL8KGe=Q@mail.gmail.com>
 <13A2A81D-5942-4A2A-8455-1BDACDA3A221@apache.org>
 <CAPG1hyN5vjr8RgrncHbiqD3XqX3gDNPqHAKz1am9STthP15EUg@mail.gmail.com>
 <5527680828350014004@unknownmsgid>
To: user@couchdb.apache.org

Thanks Mike! I filed https://issues.apache.org/jira/browse/COUCHDB-2231 =
and linked your gist in there as a possible solution.

Adam

On Apr 30, 2014, at 1:12 AM, Mike Marino <mmarino@gmail.com> wrote:

> Hi Marty,
>=20
> It's difficult for me to tell the reason that couchdb is not stopping =
using
> your init script, but we had a similar issue that I fixed by patching =
the
> couchdb startup script ("executable").  The issue was that the =
'shepherd'
> program was respawning couch after a requested shutdown.
>=20
> This was discussed some time a while ago on the list and I sent our =
fix
> out, but I don't think it was ever integrated.  Anyways, here's the =
gist
> (for 1.3, though I think the file has remained the same in the newer
> versions):
>=20
> https://gist.github.com/7601778
>=20
> Cheers,
> Mike
>=20
> Am 30.04.2014 um 06:52 schrieb Marty Hu <marty.hu@gmail.com>:
>=20
> Okay, after doing a bit more work this is what I found out:
>=20
> 1. When I start couchdb on a fresh server, it appears to run =
correctly.
>=20
> 2. However, the conventional "sudo service couchdb stop" does not =
actually
> stop couchdb correctly. I know this because I can kill the couchdb
> processes with ps -U couchdb -o pid=3D | xargs kill -9
>=20
> 3. We use chef for configuration, so at a set interval it will queue =
up a
> "sudo service couchdb restart", which will try to stop the process =
(the
> process won't stop) and then start a new process (this process will
> actually try to start). However, the second process will not be able =
to
> bind to the port (the first process never got killed and still holds =
it) so
> will throw the error.
>=20
> I imagine that this is a configuration issue (and so not really a =
fault of
> your guys) but welcoming any tips about how to deal with this short of
> changing the init script to be a messy killer.
>=20
>=20
> On Tue, Apr 29, 2014 at 6:54 PM, Adam Kocoloski <kocolosk@apache.org> =
wrote:
>=20
> Hi Marty, the mailing list stripped out the attachments except for
>=20
> spike.txt.
>=20
>=20
> I don't know if they're the cause of the load spikes that you see, but =
the
>=20
> eaddrinuse errors are not normal. They can be caused by another =
process
>=20
> listening on the same port as CouchDB. Fairly peculiar stuff.
>=20
>=20
> The timeout trying to open the splits-v0.1.7 at 21:23 does line up =
with
>=20
> your report that the system was heavily loaded at the time, but =
there's
>=20
> really not too much to go on here.
>=20
>=20
> Regards, Adam
>=20
>=20
> On Apr 29, 2014, at 7:46 PM, Marty Hu <marty.hu@gmail.com> wrote:
>=20
>=20
> Thanks for the follow-up.
>=20
>=20
> I've attached nagios graphs (load, disk, and ping) of one such event,
>=20
> which occurred at 2:24pm (after the drop in disk) according to my =
nagios
>=20
> emails. I've also attached database logs (with some client-specific =
queries
>=20
> removed). The error was fixed around 2:30pm. Notably, the log files =
are in
>=20
> GMT.
>=20
>=20
> Unfortunately I don't have any graphs for the event other than what's =
on
>=20
> nagios.
>=20
>=20
> Are the connection errors with CouchDB normal? We get them =
continuously
>=20
> (around every minute) even during normal operation with the DB not =
crashing.
>=20
>=20
>=20
> On Tue, Apr 29, 2014 at 2:34 AM, Alexander Shorin <kxepal@gmail.com>
>=20
> wrote:
>=20
> Hi Marty,
>=20
>=20
> thanks for following up! I see your problem, but what would we need:
>=20
>=20
> 1. CouchDB stats graphs and your system disk, network and memory ones.
>=20
> If you cannot share them in public, feel free to send me in private.
>=20
> We need to know they are related. For instance, high memory usage may
>=20
> be caused by uploading high amount of big files: you'll easily notice
>=20
> that comparing CouchDB, network and memory graphs for the spike
>=20
> period.
>=20
>=20
> 2. CouchDB log entries for spike event. Graphs can only show you
>=20
> that's something going wrong and we could only guess (almost we guess
>=20
> right, but without much precise) what's exactly going wrong. Logs will
>=20
> help to us to find out actual requests that causes memory spike.
>=20
>=20
> After that we can start to think about the problem. For instance, if
>=20
> spikes are happens due to large attachments uploads, there is no much
>=20
> to do. On other hand, query server may easily eat quite big chunk of
>=20
> memory. We'll easily notice that by monitoring /_active_tasks resource
>=20
> (if problem is in views) or by looking through logs for the spike
>=20
> period. And this case can be fixed.
>=20
>=20
> Not sure which tools you're using for monitoring and graphs drawing,
>=20
> but take a look on next projects:
>=20
> - https://github.com/gws/munin-plugin-couchdb - Munin plugin for
>=20
> CouchDB monitoring. Suddenly, it doesn't handles system metrics for
>=20
> CouchDB process - I'll only add this during this week, but make sure
>=20
> you have similar plugin for your monitoring system.
>=20
> - https://github.com/etsy/skyline - anomalies detector. spikes are so
>=20
> - https://github.com/etsy/oculus - metrics correlation tool. it would
>=20
> be very-very easily to compare multiple graphs for anomaly period with
>=20
> it.
>=20
>=20
> --
>=20
> ,,,^..^,,,
>=20
>=20
>=20
> On Tue, Apr 29, 2014 at 8:15 AM, Marty Hu <marty.hu@gmail.com> wrote:
>=20
> We're been running CouchDB v1.5.0 on AWS and its been working fine.
>=20
> Recently AWS came out with new prices for their new m3 instances so we
>=20
> switched our CouchDB instance to use an m3.large. We have a relatively
>=20
> small database with < 10GB of data in it.
>=20
>=20
> Our steady state metrics for it are system loads of 0.2 and memory
>=20
> usages
>=20
> of 5% or so. However, we noticed that every few hours (3-4 times per
>=20
> day)
>=20
> we get a huge spike that floors our load to 1.5 or so and memory usage
>=20
> to
>=20
> close to 100%.
>=20
>=20
> We don't run any cronjobs that involve the database and our traffic
>=20
> flow
>=20
> about the same over the day. We do run a continuous replication from
>=20
> one
>=20
> database on the west coast to another on the east coast.
>=20
>=20
> This has been stumping me for a bit - any ideas?
>=20
>=20
>=20
> <spike.txt>