couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: CouchDB load spike (even with low traffic)?
Date Wed, 30 Apr 2014 10:40:18 GMT
Thanks Mike! I filed https://issues.apache.org/jira/browse/COUCHDB-2231 and linked your gist
in there as a possible solution.

Adam

On Apr 30, 2014, at 1:12 AM, Mike Marino <mmarino@gmail.com> wrote:

> Hi Marty,
> 
> It's difficult for me to tell the reason that couchdb is not stopping using
> your init script, but we had a similar issue that I fixed by patching the
> couchdb startup script ("executable").  The issue was that the 'shepherd'
> program was respawning couch after a requested shutdown.
> 
> This was discussed some time a while ago on the list and I sent our fix
> out, but I don't think it was ever integrated.  Anyways, here's the gist
> (for 1.3, though I think the file has remained the same in the newer
> versions):
> 
> https://gist.github.com/7601778
> 
> Cheers,
> Mike
> 
> Am 30.04.2014 um 06:52 schrieb Marty Hu <marty.hu@gmail.com>:
> 
> Okay, after doing a bit more work this is what I found out:
> 
> 1. When I start couchdb on a fresh server, it appears to run correctly.
> 
> 2. However, the conventional "sudo service couchdb stop" does not actually
> stop couchdb correctly. I know this because I can kill the couchdb
> processes with ps -U couchdb -o pid= | xargs kill -9
> 
> 3. We use chef for configuration, so at a set interval it will queue up a
> "sudo service couchdb restart", which will try to stop the process (the
> process won't stop) and then start a new process (this process will
> actually try to start). However, the second process will not be able to
> bind to the port (the first process never got killed and still holds it) so
> will throw the error.
> 
> I imagine that this is a configuration issue (and so not really a fault of
> your guys) but welcoming any tips about how to deal with this short of
> changing the init script to be a messy killer.
> 
> 
> On Tue, Apr 29, 2014 at 6:54 PM, Adam Kocoloski <kocolosk@apache.org> wrote:
> 
> Hi Marty, the mailing list stripped out the attachments except for
> 
> spike.txt.
> 
> 
> I don't know if they're the cause of the load spikes that you see, but the
> 
> eaddrinuse errors are not normal. They can be caused by another process
> 
> listening on the same port as CouchDB. Fairly peculiar stuff.
> 
> 
> The timeout trying to open the splits-v0.1.7 at 21:23 does line up with
> 
> your report that the system was heavily loaded at the time, but there's
> 
> really not too much to go on here.
> 
> 
> Regards, Adam
> 
> 
> On Apr 29, 2014, at 7:46 PM, Marty Hu <marty.hu@gmail.com> wrote:
> 
> 
> Thanks for the follow-up.
> 
> 
> I've attached nagios graphs (load, disk, and ping) of one such event,
> 
> which occurred at 2:24pm (after the drop in disk) according to my nagios
> 
> emails. I've also attached database logs (with some client-specific queries
> 
> removed). The error was fixed around 2:30pm. Notably, the log files are in
> 
> GMT.
> 
> 
> Unfortunately I don't have any graphs for the event other than what's on
> 
> nagios.
> 
> 
> Are the connection errors with CouchDB normal? We get them continuously
> 
> (around every minute) even during normal operation with the DB not crashing.
> 
> 
> 
> On Tue, Apr 29, 2014 at 2:34 AM, Alexander Shorin <kxepal@gmail.com>
> 
> wrote:
> 
> Hi Marty,
> 
> 
> thanks for following up! I see your problem, but what would we need:
> 
> 
> 1. CouchDB stats graphs and your system disk, network and memory ones.
> 
> If you cannot share them in public, feel free to send me in private.
> 
> We need to know they are related. For instance, high memory usage may
> 
> be caused by uploading high amount of big files: you'll easily notice
> 
> that comparing CouchDB, network and memory graphs for the spike
> 
> period.
> 
> 
> 2. CouchDB log entries for spike event. Graphs can only show you
> 
> that's something going wrong and we could only guess (almost we guess
> 
> right, but without much precise) what's exactly going wrong. Logs will
> 
> help to us to find out actual requests that causes memory spike.
> 
> 
> After that we can start to think about the problem. For instance, if
> 
> spikes are happens due to large attachments uploads, there is no much
> 
> to do. On other hand, query server may easily eat quite big chunk of
> 
> memory. We'll easily notice that by monitoring /_active_tasks resource
> 
> (if problem is in views) or by looking through logs for the spike
> 
> period. And this case can be fixed.
> 
> 
> Not sure which tools you're using for monitoring and graphs drawing,
> 
> but take a look on next projects:
> 
> - https://github.com/gws/munin-plugin-couchdb - Munin plugin for
> 
> CouchDB monitoring. Suddenly, it doesn't handles system metrics for
> 
> CouchDB process - I'll only add this during this week, but make sure
> 
> you have similar plugin for your monitoring system.
> 
> - https://github.com/etsy/skyline - anomalies detector. spikes are so
> 
> - https://github.com/etsy/oculus - metrics correlation tool. it would
> 
> be very-very easily to compare multiple graphs for anomaly period with
> 
> it.
> 
> 
> --
> 
> ,,,^..^,,,
> 
> 
> 
> On Tue, Apr 29, 2014 at 8:15 AM, Marty Hu <marty.hu@gmail.com> wrote:
> 
> We're been running CouchDB v1.5.0 on AWS and its been working fine.
> 
> Recently AWS came out with new prices for their new m3 instances so we
> 
> switched our CouchDB instance to use an m3.large. We have a relatively
> 
> small database with < 10GB of data in it.
> 
> 
> Our steady state metrics for it are system loads of 0.2 and memory
> 
> usages
> 
> of 5% or so. However, we noticed that every few hours (3-4 times per
> 
> day)
> 
> we get a huge spike that floors our load to 1.5 or so and memory usage
> 
> to
> 
> close to 100%.
> 
> 
> We don't run any cronjobs that involve the database and our traffic
> 
> flow
> 
> about the same over the day. We do run a continuous replication from
> 
> one
> 
> database on the west coast to another on the east coast.
> 
> 
> This has been stumping me for a bit - any ideas?
> 
> 
> 
> <spike.txt>


Mime
View raw message