Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AA6BB10C2D for ; Wed, 30 Apr 2014 10:40:26 +0000 (UTC) Received: (qmail 62834 invoked by uid 500); 30 Apr 2014 10:40:24 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 62724 invoked by uid 500); 30 Apr 2014 10:40:22 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 62090 invoked by uid 99); 30 Apr 2014 10:40:21 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 10:40:21 +0000 Received: from localhost (HELO [192.168.1.105]) (127.0.0.1) (smtp-auth username kocolosk, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 10:40:20 +0000 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\)) Subject: Re: CouchDB load spike (even with low traffic)? From: Adam Kocoloski In-Reply-To: <5527680828350014004@unknownmsgid> Date: Wed, 30 Apr 2014 06:40:18 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <0EED3676-5AA4-497C-8BDB-8CB6B6FA4C0F@apache.org> References: <13A2A81D-5942-4A2A-8455-1BDACDA3A221@apache.org> <5527680828350014004@unknownmsgid> To: user@couchdb.apache.org X-Mailer: Apple Mail (2.1874) Thanks Mike! I filed https://issues.apache.org/jira/browse/COUCHDB-2231 = and linked your gist in there as a possible solution. Adam On Apr 30, 2014, at 1:12 AM, Mike Marino wrote: > Hi Marty, >=20 > It's difficult for me to tell the reason that couchdb is not stopping = using > your init script, but we had a similar issue that I fixed by patching = the > couchdb startup script ("executable"). The issue was that the = 'shepherd' > program was respawning couch after a requested shutdown. >=20 > This was discussed some time a while ago on the list and I sent our = fix > out, but I don't think it was ever integrated. Anyways, here's the = gist > (for 1.3, though I think the file has remained the same in the newer > versions): >=20 > https://gist.github.com/7601778 >=20 > Cheers, > Mike >=20 > Am 30.04.2014 um 06:52 schrieb Marty Hu : >=20 > Okay, after doing a bit more work this is what I found out: >=20 > 1. When I start couchdb on a fresh server, it appears to run = correctly. >=20 > 2. However, the conventional "sudo service couchdb stop" does not = actually > stop couchdb correctly. I know this because I can kill the couchdb > processes with ps -U couchdb -o pid=3D | xargs kill -9 >=20 > 3. We use chef for configuration, so at a set interval it will queue = up a > "sudo service couchdb restart", which will try to stop the process = (the > process won't stop) and then start a new process (this process will > actually try to start). However, the second process will not be able = to > bind to the port (the first process never got killed and still holds = it) so > will throw the error. >=20 > I imagine that this is a configuration issue (and so not really a = fault of > your guys) but welcoming any tips about how to deal with this short of > changing the init script to be a messy killer. >=20 >=20 > On Tue, Apr 29, 2014 at 6:54 PM, Adam Kocoloski = wrote: >=20 > Hi Marty, the mailing list stripped out the attachments except for >=20 > spike.txt. >=20 >=20 > I don't know if they're the cause of the load spikes that you see, but = the >=20 > eaddrinuse errors are not normal. They can be caused by another = process >=20 > listening on the same port as CouchDB. Fairly peculiar stuff. >=20 >=20 > The timeout trying to open the splits-v0.1.7 at 21:23 does line up = with >=20 > your report that the system was heavily loaded at the time, but = there's >=20 > really not too much to go on here. >=20 >=20 > Regards, Adam >=20 >=20 > On Apr 29, 2014, at 7:46 PM, Marty Hu wrote: >=20 >=20 > Thanks for the follow-up. >=20 >=20 > I've attached nagios graphs (load, disk, and ping) of one such event, >=20 > which occurred at 2:24pm (after the drop in disk) according to my = nagios >=20 > emails. I've also attached database logs (with some client-specific = queries >=20 > removed). The error was fixed around 2:30pm. Notably, the log files = are in >=20 > GMT. >=20 >=20 > Unfortunately I don't have any graphs for the event other than what's = on >=20 > nagios. >=20 >=20 > Are the connection errors with CouchDB normal? We get them = continuously >=20 > (around every minute) even during normal operation with the DB not = crashing. >=20 >=20 >=20 > On Tue, Apr 29, 2014 at 2:34 AM, Alexander Shorin >=20 > wrote: >=20 > Hi Marty, >=20 >=20 > thanks for following up! I see your problem, but what would we need: >=20 >=20 > 1. CouchDB stats graphs and your system disk, network and memory ones. >=20 > If you cannot share them in public, feel free to send me in private. >=20 > We need to know they are related. For instance, high memory usage may >=20 > be caused by uploading high amount of big files: you'll easily notice >=20 > that comparing CouchDB, network and memory graphs for the spike >=20 > period. >=20 >=20 > 2. CouchDB log entries for spike event. Graphs can only show you >=20 > that's something going wrong and we could only guess (almost we guess >=20 > right, but without much precise) what's exactly going wrong. Logs will >=20 > help to us to find out actual requests that causes memory spike. >=20 >=20 > After that we can start to think about the problem. For instance, if >=20 > spikes are happens due to large attachments uploads, there is no much >=20 > to do. On other hand, query server may easily eat quite big chunk of >=20 > memory. We'll easily notice that by monitoring /_active_tasks resource >=20 > (if problem is in views) or by looking through logs for the spike >=20 > period. And this case can be fixed. >=20 >=20 > Not sure which tools you're using for monitoring and graphs drawing, >=20 > but take a look on next projects: >=20 > - https://github.com/gws/munin-plugin-couchdb - Munin plugin for >=20 > CouchDB monitoring. Suddenly, it doesn't handles system metrics for >=20 > CouchDB process - I'll only add this during this week, but make sure >=20 > you have similar plugin for your monitoring system. >=20 > - https://github.com/etsy/skyline - anomalies detector. spikes are so >=20 > - https://github.com/etsy/oculus - metrics correlation tool. it would >=20 > be very-very easily to compare multiple graphs for anomaly period with >=20 > it. >=20 >=20 > -- >=20 > ,,,^..^,,, >=20 >=20 >=20 > On Tue, Apr 29, 2014 at 8:15 AM, Marty Hu wrote: >=20 > We're been running CouchDB v1.5.0 on AWS and its been working fine. >=20 > Recently AWS came out with new prices for their new m3 instances so we >=20 > switched our CouchDB instance to use an m3.large. We have a relatively >=20 > small database with < 10GB of data in it. >=20 >=20 > Our steady state metrics for it are system loads of 0.2 and memory >=20 > usages >=20 > of 5% or so. However, we noticed that every few hours (3-4 times per >=20 > day) >=20 > we get a huge spike that floors our load to 1.5 or so and memory usage >=20 > to >=20 > close to 100%. >=20 >=20 > We don't run any cronjobs that involve the database and our traffic >=20 > flow >=20 > about the same over the day. We do run a continuous replication from >=20 > one >=20 > database on the west coast to another on the east coast. >=20 >=20 > This has been stumping me for a bit - any ideas? >=20 >=20 >=20 >