couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sinan Gabel <sinan.ga...@gmail.com>
Subject Re: 100% CPU on only a single node because of couchjs processes
Date Mon, 04 Dec 2017 21:08:39 GMT
Hi,

I am also experiencing 100% CPU usage, not sure why, it happens suddenly
and continues until couchdb is restarted.
CouchDB version being used is also single-node (n:3, q:8) and v.
2.1.0-6c4def6 on Ubuntu 16.04 2 vCPU's and 4.5 GB memory.

On 4 December 2017 at 19:46, Geoffrey Cox <redgeoff@gmail.com> wrote:

> Hi,
>
> I've spent days using trial and error to try and figure out why I am
> getting a very high CPU load on only a single node in my cluster. I'm
> hoping someone has an idea of what is going on as I'm getting stuck.
>
> Here's my configuration:
>
>    1. 2 node cluster:
>       1. Each node is located in a different AWS availability zone
>       2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
>    2. A haproxy server is load balancing traffic to the nodes using round
>    robin
>
> The problem:
>
>    1. After users make changes via PouchDB, a backend runs a number of
>    routines that use views to calculate notifications. The issue is that
> on a
>    single node, the couchjs processes stack up and then start to consume
>    nearly all the available CPU. This server then becomes the "workhorse"
> that
>    always does *all* the heavy duty couchjs processing until I restart this
>    node.
>    2. It is important to note that both nodes have couchjs processes, but
>    it is only a single node that has the couchjs processes that are using
> 100%
>    CPU
>    3. I've even resorted to setting `os_process_limit = 10` and this just
>    results in each couchjs process taking over 10% each! In other words,
> the
>    couchjs processes just eat up all the CPU no matter how many couchjs
>    process there are!
>    4. The CPU usage will eventually clear after all the processing is done,
>    but then as soon as there is more to process the workhorse node will get
>    bogged down again.
>    5. If I restart the workhorse node, the other node then becomes the
>    workhorse node. This is the only way to get the couchjs processes to
> "move"
>    to another node.
>    6. The problem is that this design is not scalable as only one node can
>    be the workhorse node at any given time. Moreover this causes specific
>    instances to run out of CPU credits. Shouldn't the couchjs processes be
>    spread out over all my nodes? From what I can tell, if I add more nodes
> I'm
>    still going to have the issue where only one of the nodes is getting
> bogged
>    down. Is it possible that the problem is that I have 2 nodes and really
> I
>    need at least 3 nodes? (I know a 2-node cluster is not very typical)
>
>
>  Things I've checked:
>
>    1. Ensured that the load balancing is working, i.e. haproxy is indeed
>    distributing traffic accordingly
>    2. I've tried setting `os_process_limit = 10` and `os_process_soft_limit
>    = 5` to see if I could force a more conservative usage of couchjs
>    processes, but instead the couchjs processes just consume all the CPU
> load.
>    3. I've tried simulating the issue locally with VMs and I cannot
>    duplicate any such load. My guess is that this is because the nodes are
>    located on the same box so hop distance between nodes is very small and
>    this somehow keeps the CPU usage to a minimum
>    4. I've tried isolating the issue by creating short code snippets that
>    intentionally try to spawn a lot of couchjs processes and they are
> spawned
>    but don't consume 100% CPU
>    5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this
>    doesn't seem to change anything
>    6. The only error entries in my CouchDB logs are like the following and
>    I don't believe they are related to my issue:
>       1.
>
>       [error] 2017-12-04T18:13:38.728970Z couchdb@172.31.83.32
> <0.13974.79>
>       4b0b21c664 rexi_server: from: couchdb@172.31.83.32(<0.20638.79>)
> mfa:
>       fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to
> access
>       this db.">>}
>       [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{
> fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{
> line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.
> erl"},{line,139}]}]
>
> Does CouchDB have some logic built in that spawns a number of couchjs
> processes on a "primary" node? Will future view processing then always be
> routed to this "primary" node?
>
> Is there a way to better distribute these heavy duty couchjs processes? Is
> it possible to limit their CPU consumption? (I'm hesitant to start down the
> path of using something like cpulimit as I think there is a root problem
> that needs to be addressed)
>
> I'm running out of ideas and hope that someone has some notion of what is
> causing this bizarre load or if there is a bug in CouchDB.
>
> Thank you for any help you can provide!
>
> Geoff
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message