couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geoffrey Cox <redge...@gmail.com>
Subject Re: 100% CPU on only a single node because of couchjs processes
Date Tue, 05 Dec 2017 22:48:51 GMT
Hi Adam, quick follow-up: is it possible that writes can also be designed
to a "primary" node, like the `stale` option for a view? I was originally
thinking that the issue is with reading data via a view, but now I'm
thinking it may be related to writing data and those writes somehow
triggering these persistent and heavyweight couchjs processes. It's tough
to say as I'd imagine that you don't have couchjs load unless you have
frequent writing and reading. I'm still trying to isolate the issue and it
is difficult as the problem only seems to happen in a production env and
only with *all* the production code, figures ;)

On Tue, Dec 5, 2017 at 12:13 PM Geoffrey Cox <redgeoff@gmail.com> wrote:

> Hey Adam,
>
> Attached is my local.ini and the design doc with the view JS.
>
> Please see my responses below:
>
> Thanks for the help!
>
> On Tue, Dec 5, 2017 at 8:55 AM Adam Kocoloski <kocolosk@apache.org> wrote:
>
>> Hi Geoff, a couple of additional questions:
>>
>> 1) Are you making these view requests with stale=ok or stale=update_after?
>>
> GC: I am not using the stale parameter
>
>> 2) What are you using for N and Q in the [cluster] configuration settings?
>>
> GC: As per the attached local.ini, I specified n=2 and am using the
> default q=8.
>
>> 3) Did you take advantage of the (barely-documented) “zones" attribute
>> when defining cluster members?
>>
> GC: As per the attached local.ini, I have *not* specified this option.
>
>> 4) Do you have any other JS code besides the view definitions?
>>
> GC: When you refer to JS code, I think you mean in terms of JS code "in"
> CouchDB and if that is the case then my only JS code is very simple views
> like those in the attached view.json. (I know that I really need to break
> out the views so that there is one view per doc, but I haven't quite gotten
> around to refactoring this and I don't believe this is causing the CPU
> usage)
>
>>
>> Regarding #1, the cluster will actually select shards differently
>> depending on the use of those query parameters. When your request
>> stipulates that you’re OK with stale results the cluster *will* select a
>> “primary” copy in order to improve the consistency of repeated requests to
>> the same view. The algorithm for choosing those primary copies is somewhat
>> subtle hence my question #3.
>>
>> If you’re not using stale requests I have a much harder time explaining
>> why the 100% CPU issue would migrate from node to node like that.
>>
>> Adam
>>
>> > On Dec 5, 2017, at 9:36 AM, Geoffrey Cox <redgeoff@gmail.com> wrote:
>> >
>> > Thanks for the responses, any other thoughts?
>> >
>> > FYI: I’m trying to work on a very focused test case that I can share
>> with
>> > the Dev team, but it is taking a little while to narrow down the exact
>> > cause.
>> > On Tue, Dec 5, 2017 at 4:43 AM Robert Samuel Newson <rnewson@apache.org
>> >
>> > wrote:
>> >
>> >> Sorry to contradict you, but Cloudant deploys clusters across amazon
>> AZ's
>> >> as standard. It's fast enough. It's cross-region that you need to
>> avoid.
>> >>
>> >> B.
>> >>
>> >>> On 5 Dec 2017, at 09:11, Jan Lehnardt <jan@apache.org> wrote:
>> >>>
>> >>> Heya Geoff,
>> >>>
>> >>> a CouchDB cluster is designed to run in the same data center / with
>> >> local are networking latencies. A cluster across AWS Availability Zones
>> >> won’t work as you see. If you want CouchDB’s in both AZs, use regular
>> >> replication and keep the clusters local to the AZ.
>> >>>
>> >>> Best
>> >>> Jan
>> >>> --
>> >>>
>> >>>> On 4. Dec 2017, at 19:46, Geoffrey Cox <redgeoff@gmail.com>
wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I've spent days using trial and error to try and figure out why
I am
>> >>>> getting a very high CPU load on only a single node in my cluster.
I'm
>> >>>> hoping someone has an idea of what is going on as I'm getting stuck.
>> >>>>
>> >>>> Here's my configuration:
>> >>>>
>> >>>> 1. 2 node cluster:
>> >>>>    1. Each node is located in a different AWS availability zone
>> >>>>    2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
>> >>>> 2. A haproxy server is load balancing traffic to the nodes using
>> round
>> >>>> robin
>> >>>>
>> >>>> The problem:
>> >>>>
>> >>>> 1. After users make changes via PouchDB, a backend runs a number
of
>> >>>> routines that use views to calculate notifications. The issue is
that
>> >> on a
>> >>>> single node, the couchjs processes stack up and then start to consume
>> >>>> nearly all the available CPU. This server then becomes the
>> "workhorse"
>> >> that
>> >>>> always does *all* the heavy duty couchjs processing until I restart
>> >> this
>> >>>> node.
>> >>>> 2. It is important to note that both nodes have couchjs processes,
>> but
>> >>>> it is only a single node that has the couchjs processes that are
>> using
>> >> 100%
>> >>>> CPU
>> >>>> 3. I've even resorted to setting `os_process_limit = 10` and this
>> just
>> >>>> results in each couchjs process taking over 10% each! In other words,
>> >> the
>> >>>> couchjs processes just eat up all the CPU no matter how many couchjs
>> >>>> process there are!
>> >>>> 4. The CPU usage will eventually clear after all the processing
is
>> >> done,
>> >>>> but then as soon as there is more to process the workhorse node
will
>> >> get
>> >>>> bogged down again.
>> >>>> 5. If I restart the workhorse node, the other node then becomes
the
>> >>>> workhorse node. This is the only way to get the couchjs processes
to
>> >> "move"
>> >>>> to another node.
>> >>>> 6. The problem is that this design is not scalable as only one node
>> can
>> >>>> be the workhorse node at any given time. Moreover this causes
>> specific
>> >>>> instances to run out of CPU credits. Shouldn't the couchjs processes
>> be
>> >>>> spread out over all my nodes? From what I can tell, if I add more
>> >> nodes I'm
>> >>>> still going to have the issue where only one of the nodes is getting
>> >> bogged
>> >>>> down. Is it possible that the problem is that I have 2 nodes and
>> >> really I
>> >>>> need at least 3 nodes? (I know a 2-node cluster is not very typical)
>> >>>>
>> >>>>
>> >>>> Things I've checked:
>> >>>>
>> >>>> 1. Ensured that the load balancing is working, i.e. haproxy is indeed
>> >>>> distributing traffic accordingly
>> >>>> 2. I've tried setting `os_process_limit = 10` and
>> >> `os_process_soft_limit
>> >>>> = 5` to see if I could force a more conservative usage of couchjs
>> >>>> processes, but instead the couchjs processes just consume all the
CPU
>> >> load.
>> >>>> 3. I've tried simulating the issue locally with VMs and I cannot
>> >>>> duplicate any such load. My guess is that this is because the nodes
>> are
>> >>>> located on the same box so hop distance between nodes is very small
>> and
>> >>>> this somehow keeps the CPU usage to a minimum
>> >>>> 4. I've tried isolating the issue by creating short code snippets
>> that
>> >>>> intentionally try to spawn a lot of couchjs processes and they are
>> >> spawned
>> >>>> but don't consume 100% CPU
>> >>>> 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and
this
>> >>>> doesn't seem to change anything
>> >>>> 6. The only error entries in my CouchDB logs are like the following
>> and
>> >>>> I don't believe they are related to my issue:
>> >>>>    1.
>> >>>>
>> >>>>    [error] 2017-12-04T18:13:38.728970Z couchdb@172.31.83.32
>> >> <0.13974.79>
>> >>>>    4b0b21c664 rexi_server: from: couchdb@172.31.83.32(<0.20638.79>)
>> >> mfa:
>> >>>>    fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not
allowed to
>> >> access
>> >>>>    this db.">>}
>> >>>>
>> >>
>> [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
>> >>>>
>> >>>> Does CouchDB have some logic built in that spawns a number of couchjs
>> >>>> processes on a "primary" node? Will future view processing then
>> always
>> >> be
>> >>>> routed to this "primary" node?
>> >>>>
>> >>>> Is there a way to better distribute these heavy duty couchjs
>> processes?
>> >> Is
>> >>>> it possible to limit their CPU consumption? (I'm hesitant to start
>> down
>> >> the
>> >>>> path of using something like cpulimit as I think there is a root
>> problem
>> >>>> that needs to be addressed)
>> >>>>
>> >>>> I'm running out of ideas and hope that someone has some notion of
>> what
>> >> is
>> >>>> causing this bizarre load or if there is a bug in CouchDB.
>> >>>>
>> >>>> Thank you for any help you can provide!
>> >>>>
>> >>>> Geoff
>> >>>
>> >>> --
>> >>> Professional Support for Apache CouchDB:
>> >>> https://neighbourhood.ie/couchdb-support/
>> >>>
>> >>
>> >>
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message