Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CCBCA10C32 for ; Wed, 30 Apr 2014 01:54:22 +0000 (UTC) Received: (qmail 33186 invoked by uid 500); 30 Apr 2014 01:54:21 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 33118 invoked by uid 500); 30 Apr 2014 01:54:20 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 33110 invoked by uid 99); 30 Apr 2014 01:54:20 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 01:54:20 +0000 Received: from localhost (HELO [192.168.1.105]) (127.0.0.1) (smtp-auth username kocolosk, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 01:54:19 +0000 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\)) Subject: Re: CouchDB load spike (even with low traffic)? From: Adam Kocoloski In-Reply-To: Date: Tue, 29 Apr 2014 21:54:17 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <13A2A81D-5942-4A2A-8455-1BDACDA3A221@apache.org> References: To: user@couchdb.apache.org X-Mailer: Apple Mail (2.1874) Hi Marty, the mailing list stripped out the attachments except for = spike.txt. I don't know if they're the cause of the load spikes that you see, but = the eaddrinuse errors are not normal. They can be caused by another = process listening on the same port as CouchDB. Fairly peculiar stuff. The timeout trying to open the splits-v0.1.7 at 21:23 does line up with = your report that the system was heavily loaded at the time, but there's = really not too much to go on here. Regards, Adam On Apr 29, 2014, at 7:46 PM, Marty Hu wrote: > Thanks for the follow-up. >=20 > I've attached nagios graphs (load, disk, and ping) of one such event, = which occurred at 2:24pm (after the drop in disk) according to my nagios = emails. I've also attached database logs (with some client-specific = queries removed). The error was fixed around 2:30pm. Notably, the log = files are in GMT. >=20 > Unfortunately I don't have any graphs for the event other than what's = on nagios.=20 >=20 > Are the connection errors with CouchDB normal? We get them = continuously (around every minute) even during normal operation with the = DB not crashing. >=20 >=20 > On Tue, Apr 29, 2014 at 2:34 AM, Alexander Shorin = wrote: > Hi Marty, >=20 > thanks for following up! I see your problem, but what would we need: >=20 > 1. CouchDB stats graphs and your system disk, network and memory ones. > If you cannot share them in public, feel free to send me in private. > We need to know they are related. For instance, high memory usage may > be caused by uploading high amount of big files: you'll easily notice > that comparing CouchDB, network and memory graphs for the spike > period. >=20 > 2. CouchDB log entries for spike event. Graphs can only show you > that's something going wrong and we could only guess (almost we guess > right, but without much precise) what's exactly going wrong. Logs will > help to us to find out actual requests that causes memory spike. >=20 > After that we can start to think about the problem. For instance, if > spikes are happens due to large attachments uploads, there is no much > to do. On other hand, query server may easily eat quite big chunk of > memory. We'll easily notice that by monitoring /_active_tasks resource > (if problem is in views) or by looking through logs for the spike > period. And this case can be fixed. >=20 > Not sure which tools you're using for monitoring and graphs drawing, > but take a look on next projects: > - https://github.com/gws/munin-plugin-couchdb - Munin plugin for > CouchDB monitoring. Suddenly, it doesn't handles system metrics for > CouchDB process - I'll only add this during this week, but make sure > you have similar plugin for your monitoring system. > - https://github.com/etsy/skyline - anomalies detector. spikes are so > - https://github.com/etsy/oculus - metrics correlation tool. it would > be very-very easily to compare multiple graphs for anomaly period with > it. >=20 > -- > ,,,^..^,,, >=20 >=20 > On Tue, Apr 29, 2014 at 8:15 AM, Marty Hu wrote: > > We're been running CouchDB v1.5.0 on AWS and its been working fine. > > Recently AWS came out with new prices for their new m3 instances so = we > > switched our CouchDB instance to use an m3.large. We have a = relatively > > small database with < 10GB of data in it. > > > > Our steady state metrics for it are system loads of 0.2 and memory = usages > > of 5% or so. However, we noticed that every few hours (3-4 times per = day) > > we get a huge spike that floors our load to 1.5 or so and memory = usage to > > close to 100%. > > > > We don't run any cronjobs that involve the database and our traffic = flow > > about the same over the day. We do run a continuous replication from = one > > database on the west coast to another on the east coast. > > > > This has been stumping me for a bit - any ideas? >=20 >=20 >