Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\))
Subject: Re: CouchDB load spike (even with low traffic)?
From: Adam Kocoloski <kocolosk@apache.org>
In-Reply-To: 
 <CAPG1hyOJzpqyx1uQZ8Epu+Xm7T4L=w8PAj5X=fBV9zbL8KGe=Q@mail.gmail.com>
Date: Tue, 29 Apr 2014 21:54:17 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <13A2A81D-5942-4A2A-8455-1BDACDA3A221@apache.org>
References: 
 <CAPG1hyN6uX5=B904OGXyTmwXfs8UQJhgAN8Tqhs62WzZ2dnpBw@mail.gmail.com>
 <CAHdjipJA+T6f8Cwe0NCCcPSi3cVCTmqfwJ9CtpjbRPJnwASE0w@mail.gmail.com>
 <CAPG1hyOJzpqyx1uQZ8Epu+Xm7T4L=w8PAj5X=fBV9zbL8KGe=Q@mail.gmail.com>
To: user@couchdb.apache.org

Hi Marty, the mailing list stripped out the attachments except for =
spike.txt.

I don't know if they're the cause of the load spikes that you see, but =
the eaddrinuse errors are not normal. They can be caused by another =
process listening on the same port as CouchDB. Fairly peculiar stuff.

The timeout trying to open the splits-v0.1.7 at 21:23 does line up with =
your report that the system was heavily loaded at the time, but there's =
really not too much to go on here.

Regards, Adam

On Apr 29, 2014, at 7:46 PM, Marty Hu <marty.hu@gmail.com> wrote:

> Thanks for the follow-up.
>=20
> I've attached nagios graphs (load, disk, and ping) of one such event, =
which occurred at 2:24pm (after the drop in disk) according to my nagios =
emails. I've also attached database logs (with some client-specific =
queries removed). The error was fixed around 2:30pm. Notably, the log =
files are in GMT.
>=20
> Unfortunately I don't have any graphs for the event other than what's =
on nagios.=20
>=20
> Are the connection errors with CouchDB normal? We get them =
continuously (around every minute) even during normal operation with the =
DB not crashing.
>=20
>=20
> On Tue, Apr 29, 2014 at 2:34 AM, Alexander Shorin <kxepal@gmail.com> =
wrote:
> Hi Marty,
>=20
> thanks for following up! I see your problem, but what would we need:
>=20
> 1. CouchDB stats graphs and your system disk, network and memory ones.
> If you cannot share them in public, feel free to send me in private.
> We need to know they are related. For instance, high memory usage may
> be caused by uploading high amount of big files: you'll easily notice
> that comparing CouchDB, network and memory graphs for the spike
> period.
>=20
> 2. CouchDB log entries for spike event. Graphs can only show you
> that's something going wrong and we could only guess (almost we guess
> right, but without much precise) what's exactly going wrong. Logs will
> help to us to find out actual requests that causes memory spike.
>=20
> After that we can start to think about the problem. For instance, if
> spikes are happens due to large attachments uploads, there is no much
> to do. On other hand, query server may easily eat quite big chunk of
> memory. We'll easily notice that by monitoring /_active_tasks resource
> (if problem is in views) or by looking through logs for the spike
> period. And this case can be fixed.
>=20
> Not sure which tools you're using for monitoring and graphs drawing,
> but take a look on next projects:
> - https://github.com/gws/munin-plugin-couchdb - Munin plugin for
> CouchDB monitoring. Suddenly, it doesn't handles system metrics for
> CouchDB process - I'll only add this during this week, but make sure
> you have similar plugin for your monitoring system.
> - https://github.com/etsy/skyline - anomalies detector. spikes are so
> - https://github.com/etsy/oculus - metrics correlation tool. it would
> be very-very easily to compare multiple graphs for anomaly period with
> it.
>=20
> --
> ,,,^..^,,,
>=20
>=20
> On Tue, Apr 29, 2014 at 8:15 AM, Marty Hu <marty.hu@gmail.com> wrote:
> > We're been running CouchDB v1.5.0 on AWS and its been working fine.
> > Recently AWS came out with new prices for their new m3 instances so =
we
> > switched our CouchDB instance to use an m3.large. We have a =
relatively
> > small database with < 10GB of data in it.
> >
> > Our steady state metrics for it are system loads of 0.2 and memory =
usages
> > of 5% or so. However, we noticed that every few hours (3-4 times per =
day)
> > we get a huge spike that floors our load to 1.5 or so and memory =
usage to
> > close to 100%.
> >
> > We don't run any cronjobs that involve the database and our traffic =
flow
> > about the same over the day. We do run a continuous replication from =
one
> > database on the west coast to another on the east coast.
> >
> > This has been stumping me for a bit - any ideas?
>=20
>=20
> <spike.txt>