couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérôme Augé <jerome.a...@anakeen.com>
Subject Re: High memory consumption of a single node CouchDB server
Date Mon, 08 Jul 2019 07:02:25 GMT
Hi,

Here is an update on our high memory usage problem.

The client has not yet given the green light for the upgrade to CouchDB
2.3.1, so we are still on 2.1.1.

However, in the meantime, we made some changes to the compaction's default
parameters and it seems we are seeing some positive effects.

The changes made (as we were using the default settings for the compaction)
was to lower the fragmentation thresholds and specify an explicit timeframe
for the compaction:

--8<--
[compactions]
_default = [{db_fragmentation, "30%"}, {view_fragmentation, "30%"}, {from,
"03:00"}, {to, "04:55"}]
-->8--

We did not immediately restarted the daemon and just waited for him to get
OOM-killed and reload itself with the new configuration (which occurred at
noon on Wednesday July 3).

The following morning (at 3 AM on Thursday July 4) we saw a "dramatic" decrease
in disk usage from 60 GB down to 25-30 GB!

And since then, the `beam.smp` process seems to be taking no more than 2 GB
of RSS memory, and does not seems to show increasing spikes in memory usage.

We'll keep monitoring it in the following days.

Regards,
Jérôme


Le jeu. 20 juin 2019 à 17:07, Jérôme Augé <jerome.auge@anakeen.com> a
écrit :

> We are going to plan an upgrade from 2.1.1 to 2.3.1 in the coming weeks.
>
> I have a side question concerning CouchDB's upgrades: is the database
> binary compatible between v2.1.1 and v2.3.1? In the case we ever need to
> downgrade back to 2.1.1, do the binary data can be kept?
>
> Regards,
> Jérôme
>
> Le mer. 19 juin 2019 à 08:59, Jérôme Augé <jerome.auge@anakeen.com> a
> écrit :
>
>> Thanks Adam for your explanations!
>>
>> The "update_lru_on_read" is already set to false on this instance (I had
>> already seen the comments on these pull-requests).
>>
>> We are effectively running an "old" 2.1.1 version, and we have advised
>> the client that an upgrade might be needed to sort out (or further
>> investigate) these problems.
>>
>> Thanks again,
>> Jérôme
>>
>>
>>
>> Le mar. 18 juin 2019 à 18:59, Adam Kocoloski <kocolosk@apache.org> a
>> écrit :
>>
>>> Hi Jérôme, definitely useful.
>>>
>>> The “run_queue” is the number of Erlang processes in a runnable state
>>> that are not currently executing on a scheduler. When that value is greater
>>> than zero it means the node is hitting some compute limitations. Seeing a
>>> small positive value from time to time is no problem.
>>>
>>> Your last six snapshots show a message queue backlog in couch_server.
>>> That could be what caused the node to OOM. The couch_server process is a
>>> singleton and if it accumulates a large message backlog there are limited
>>> backpressure or scaling mechanisms to help it recover. I noticed you’re
>>> running 2.1.1; there were a couple of important enhancements to reduce the
>>> message flow through couch_server in more recent releases:
>>>
>>> 2.2.0: https://github.com/apache/couchdb/pull/1118 <
>>> https://github.com/apache/couchdb/pull/1118>
>>> 2.3.1: https://github.com/apache/couchdb/pull/1593 <
>>> https://github.com/apache/couchdb/pull/1593>
>>>
>>> The change in 2.2.0 is just a change in the default configuration; you
>>> can try applying it to your server by setting:
>>>
>>> [couchdb]
>>> update_lru_on_read = false
>>>
>>> The changes in 2.3.1 offer additional benefits for couch_server message
>>> throughput but you’ll need to upgrade to get them.
>>>
>>> Cheers, Adam
>>>
>>> P.S. II don’t know what’s going on with the negative memory.other value
>>> there, it’s not intentionally meaningful :)
>>>
>>>
>>> > On Jun 18, 2019, at 11:30 AM, Jérôme Augé <jerome.auge@anakeen.com>
>>> wrote:
>>> >
>>> > "beam.smp" just got killed by OOM, but I was not in front of the
>>> machine to
>>> > perform this command...
>>> >
>>> > However, here is the CouchDB log of "/_node/_local/_system" for the 30
>>> > minutes preceding the OOM:
>>> > -
>>> >
>>> https://gist.github.com/eguaj/1fba3eda4667a999fa691ff1902f04fc#file-log-couchdb-system-2019-06-18-log
>>> >
>>> > I guess the spike that triggers the OOM is so quick (< 1min) that it
>>> does
>>> > not gets logged (I log every minute).
>>> >
>>> > Is there anything that can be used/deduced from the last line logged at
>>> > 2019-06-18T16:00:14+0200?
>>> >
>>> > At 15:55:25, the "run_queue" is at 36: what does it means? Number of
>>> active
>>> > concurrent requests?
>>> >
>>> > From 15:56 to 16:00 the "memory"."other" value is a negative value:
>>> does it
>>> > means something special? or just an integer overflow?
>>> >
>>> >
>>> >
>>> > Le lun. 17 juin 2019 à 14:09, Vladimir Ralev <vladimir.ralev@gmail.com>
>>> a
>>> > écrit :
>>> >
>>> >> Alright, I think the issue will be more visible towards the OOM point,
>>> >> however for now since you have the system live with a leak, it will
be
>>> >> useful to repeat the same steps, but replace
>>> >> "message_queue_len" with "total_heap_size" then with "heap_size" then
>>> with
>>> >> "stack_size" and then with "reductions".
>>> >>
>>> >> For example:
>>> >>
>>> >> MQSizes2 = lists:map(fun(A) -> {_,B} = case
>>> process_info(A,total_heap_size)
>>> >> of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0}
end,
>>> {B,A}
>>> >> end, processes()).
>>> >>
>>> >> Then same with the other params.
>>> >>
>>> >> That can shed some light, otherwise someone will need to monitor
>>> process
>>> >> count and go into them by age and memory patterns.
>>> >>
>>> >> On Mon, Jun 17, 2019 at 2:55 PM Jérôme Augé <jerome.auge@anakeen.com>
>>> >> wrote:
>>> >>
>>> >>> The 2G consumption is from Adrien's system.
>>> >>>
>>> >>> On mine, since I setup the logging of "/_node/_local/_system" output
>>> :
>>> >>> - on june 14th max memory.processes was 2.6 GB
>>> >>> - on june 15th max memory.processes was 4.7 GB
>>> >>> - on june 16th max memory.processes was 7.0 GB
>>> >>> - today (june 17th) max memory.processes was 8.0 GB (and with an
>>> >>> interactive top I see spikes at 12 GB)
>>> >>>
>>> >>> The memory.processes seems to be steadily increasing over the days,
>>> and
>>> >> I'm
>>> >>> soon expecting the out-of-memory condition to be triggered in a
>>> couple of
>>> >>> days.
>>> >>>
>>> >>> Le lun. 17 juin 2019 à 11:53, Vladimir Ralev <
>>> vladimir.ralev@gmail.com>
>>> >> a
>>> >>> écrit :
>>> >>>
>>> >>>> Nothing to see here, the message queue stat from Adam's advice
is
>>> >>> accurate.
>>> >>>> Note that you should run this only when there is already an
>>> >> unreasonable
>>> >>>> amount memory leaked/consumed.
>>> >>>>
>>> >>>> But now I realise you had "processes":1877591424 before restart
from
>>> >> the
>>> >>>> stats above which is less than 2G. Are you using only 2 gigs
of
>>> RAM? I
>>> >>> got
>>> >>>> confused by the initial comment and I thought you had 15GB RAM.
If
>>> you
>>> >>> are
>>> >>>> only using 2 gigs of RAM, it's probably not enough for your
>>> workload.
>>> >>>>
>>> >>>> On Mon, Jun 17, 2019 at 12:15 PM Jérôme Augé <
>>> jerome.auge@anakeen.com>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> That command seems to work, and here is the output:
>>> >>>>>
>>> >>>>> --8<--
>>> >>>>> # /opt/couchdb/bin/remsh < debug.2.remsh
>>> >>>>> Eshell V7.3  (abort with ^G)
>>> >>>>> (remsh22574@127.0.0.1)1> [{0,<0.0.0>},
>>> >>>>> {0,<0.3.0>},
>>> >>>>> {0,<0.6.0>},
>>> >>>>> {0,<0.7.0>},
>>> >>>>> {0,<0.9.0>},
>>> >>>>> {0,<0.10.0>},
>>> >>>>> {0,<0.11.0>},
>>> >>>>> {0,<0.12.0>},
>>> >>>>> {0,<0.14.0>},
>>> >>>>> {0,<0.15.0>},
>>> >>>>> {0,<0.16.0>},
>>> >>>>> {0,<0.17.0>},
>>> >>>>> {0,<0.18.0>},
>>> >>>>> {0,<0.19.0>},
>>> >>>>> {0,<0.20.0>},
>>> >>>>> {0,<0.21.0>},
>>> >>>>> {0,<0.22.0>},
>>> >>>>> {0,<0.23.0>},
>>> >>>>> {0,<0.24.0>},
>>> >>>>> {0,<0.25.0>},
>>> >>>>> {0,<0.26.0>},
>>> >>>>> {0,<0.27.0>},
>>> >>>>> {0,<0.28.0>},
>>> >>>>> {0,<0.29.0>},
>>> >>>>> {0,<0.31.0>},
>>> >>>>> {0,<0.32.0>},
>>> >>>>> {0,<0.33.0>},
>>> >>>>> {0,...},
>>> >>>>> {...}]
>>> >>>>> (remsh22574@127.0.0.1)2> {0,<0.38.0>}
>>> >>>>> (remsh22574@127.0.0.1)3>
>>> [{current_function,{erl_eval,do_apply,6}},
>>> >>>>> {initial_call,{erlang,apply,2}},
>>> >>>>> {status,running},
>>> >>>>> {message_queue_len,0},
>>> >>>>> {messages,[]},
>>> >>>>> {links,[<0.32.0>]},
>>> >>>>> {dictionary,[]},
>>> >>>>> {trap_exit,false},
>>> >>>>> {error_handler,error_handler},
>>> >>>>> {priority,normal},
>>> >>>>> {group_leader,<0.31.0>},
>>> >>>>> {total_heap_size,5172},
>>> >>>>> {heap_size,2586},
>>> >>>>> {stack_size,24},
>>> >>>>> {reductions,24496},
>>> >>>>> {garbage_collection,[{min_bin_vheap_size,46422},
>>> >>>>>                      {min_heap_size,233},
>>> >>>>>                      {fullsweep_after,65535},
>>> >>>>>                      {minor_gcs,1}]},
>>> >>>>> {suspending,[]}]
>>> >>>>> (remsh22574@127.0.0.1)4> *** Terminating erlang ('
>>> >> remsh22574@127.0.0.1
>>> >>> ')
>>> >>>>> -->8--
>>> >>>>>
>>> >>>>> What should I be looking for in this output?
>>> >>>>>
>>> >>>>> Le ven. 14 juin 2019 à 17:30, Vladimir Ralev <
>>> >> vladimir.ralev@gmail.com
>>> >>>>
>>> >>>> a
>>> >>>>> écrit :
>>> >>>>>
>>> >>>>>> That means your couch is creating and destroying processes
too
>>> >>>> rapidly. I
>>> >>>>>> haven't seen this, however I think Adam's message_queues
stat
>>> above
>>> >>>> does
>>> >>>>>> the same thing. I didn't notice you can get it from
there.
>>> >>>>>>
>>> >>>>>> Either way it will be useful if you can get the shell
to work:
>>> >>>>>> Try this command instead for the first, the rest will
be the same:
>>> >>>>>>
>>> >>>>>> MQSizes2 = lists:map(fun(A) -> {_,B} = case
>>> >>>>>> process_info(A,message_queue_len) of {XV,XB} -> {XV,
XB}; _ERR ->
>>> >>>>>> io:format("~p",[_ERR]),{ok, 0} end, {B,A} end, processes()).
>>> >>>>>>
>>> >>>>>> On Fri, Jun 14, 2019 at 5:52 PM Jérôme Augé <
>>> >> jerome.auge@anakeen.com
>>> >>>>
>>> >>>>>> wrote:
>>> >>>>>>
>>> >>>>>>> I tried the following, but it seems to fail on the
first command:
>>> >>>>>>>
>>> >>>>>>> --8<--
>>> >>>>>>> # /opt/couchdb/bin/remsh
>>> >>>>>>> Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit]
[smp:8:8]
>>> >>>>>>> [async-threads:10] [hipe] [kernel-poll:false]
>>> >>>>>>>
>>> >>>>>>> Eshell V7.3  (abort with ^G)
>>> >>>>>>> (couchdb@127.0.0.1)1> MQSizes2 = lists:map(fun(A)
-> {_,B} =
>>> >>>>>>> process_info(A,message_queue_len), {B,A} end, processes()).
>>> >>>>>>> ** exception error: no match of right hand side
value undefined
>>> >>>>>>> -->8--
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> Le ven. 14 juin 2019 à 16:08, Vladimir Ralev <
>>> >>>> vladimir.ralev@gmail.com
>>> >>>>>>
>>> >>>>>> a
>>> >>>>>>> écrit :
>>> >>>>>>>
>>> >>>>>>>> Hey guys. I bet it's a mailbox leaking memory.
I am very
>>> >>> interested
>>> >>>>> in
>>> >>>>>>>> debugging issues like this too.
>>> >>>>>>>>
>>> >>>>>>>> I can suggest to get an erlang shell and run
these commands to
>>> >>> see
>>> >>>>> the
>>> >>>>>>> top
>>> >>>>>>>> memory consuming processes
>>> >>>>>>>>
>>> >>> https://www.mail-archive.com/user@couchdb.apache.org/msg29365.html
>>> >>>>>>>>
>>> >>>>>>>> One issue I will be reporting soon is if one
of your nodes is
>>> >>> down
>>> >>>>> for
>>> >>>>>>> some
>>> >>>>>>>> amount of time, it seems like all databases
independently try
>>> >> and
>>> >>>>> retry
>>> >>>>>>> to
>>> >>>>>>>> query the missing node and fail, resulting in
printing a lot of
>>> >>>> logs
>>> >>>>>> for
>>> >>>>>>>> each db which can overwhelm the logger process.
If you have a
>>> >> lot
>>> >>>> of
>>> >>>>>> DBs
>>> >>>>>>>> this makes the problem worse, but it doesn't
happen right away
>>> >>> for
>>> >>>>> some
>>> >>>>>>>> reason.
>>> >>>>>>>>
>>> >>>>>>>> On Fri, Jun 14, 2019 at 4:25 PM Adrien Vergé
<
>>> >>>>> adrien.verge@tolteck.com
>>> >>>>>>>
>>> >>>>>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>> Hi Jérôme and Adam,
>>> >>>>>>>>>
>>> >>>>>>>>> That's funny, because I'm investigating
the exact same
>>> >> problem
>>> >>>>> these
>>> >>>>>>>> days.
>>> >>>>>>>>> We have a two CouchDB setups:
>>> >>>>>>>>> - a one-node server (q=2 n=1) with 5000
databases
>>> >>>>>>>>> - a 3-node cluster (q=2 n=3) with 50000
databases
>>> >>>>>>>>>
>>> >>>>>>>>> ... and we are experiencing the problem
on both setups. We've
>>> >>>> been
>>> >>>>>>> having
>>> >>>>>>>>> this problem for at least 3-4 months.
>>> >>>>>>>>>
>>> >>>>>>>>> We've monitored:
>>> >>>>>>>>>
>>> >>>>>>>>> - The number of open files: it's relatively
low (both the
>>> >>>> system's
>>> >>>>>>> total
>>> >>>>>>>>> and or fds opened by beam.smp).
>>> >>>>>>>>>  https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png
>>> >>>>>>>>>
>>> >>>>>>>>> - The usage of RAM, total used and used
by beam.smp
>>> >>>>>>>>>  https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png
>>> >>>>>>>>>  It continuously grows, with regular spikes,
until killing
>>> >>>> CouchDB
>>> >>>>>>> with
>>> >>>>>>>> an
>>> >>>>>>>>> OOM. After restart, the RAM usage is nice
and low, and no
>>> >>> spikes.
>>> >>>>>>>>>
>>> >>>>>>>>> - /_node/_local/_system metrics, before
and after restart.
>>> >>> Values
>>> >>>>>> that
>>> >>>>>>>>> significantly differ (before / after restart)
are listed
>>> >> here:
>>> >>>>>>>>>  - uptime (obviously ;-))
>>> >>>>>>>>>  - memory.processes : + 3732 %
>>> >>>>>>>>>  - memory.processes_used : + 3735 %
>>> >>>>>>>>>  - memory.binary : + 17700 %
>>> >>>>>>>>>  - context_switches : + 17376 %
>>> >>>>>>>>>  - reductions : + 867832 %
>>> >>>>>>>>>  - garbage_collection_count : + 448248 %
>>> >>>>>>>>>  - words_reclaimed : + 112755 %
>>> >>>>>>>>>  - io_input : + 44226 %
>>> >>>>>>>>>  - io_output : + 157951 %
>>> >>>>>>>>>
>>> >>>>>>>>> Before CouchDB restart:
>>> >>>>>>>>> {
>>> >>>>>>>>>  "uptime":2712973,
>>> >>>>>>>>>  "memory":{
>>> >>>>>>>>>    "other":7250289,
>>> >>>>>>>>>    "atom":512625,
>>> >>>>>>>>>    "atom_used":510002,
>>> >>>>>>>>>    "processes":1877591424,
>>> >>>>>>>>>    "processes_used":1877504920,
>>> >>>>>>>>>    "binary":177468848,
>>> >>>>>>>>>    "code":9653286,
>>> >>>>>>>>>    "ets":16012736
>>> >>>>>>>>>  },
>>> >>>>>>>>>  "run_queue":0,
>>> >>>>>>>>>  "ets_table_count":102,
>>> >>>>>>>>>  "context_switches":1621495509,
>>> >>>>>>>>>  "reductions":968705947589,
>>> >>>>>>>>>  "garbage_collection_count":331826928,
>>> >>>>>>>>>  "words_reclaimed":269964293572,
>>> >>>>>>>>>  "io_input":8812455,
>>> >>>>>>>>>  "io_output":20733066,
>>> >>>>>>>>>  ...
>>> >>>>>>>>>
>>> >>>>>>>>> After CouchDB restart:
>>> >>>>>>>>> {
>>> >>>>>>>>>  "uptime":206,
>>> >>>>>>>>>  "memory":{
>>> >>>>>>>>>    "other":6907493,
>>> >>>>>>>>>    "atom":512625,
>>> >>>>>>>>>    "atom_used":497769,
>>> >>>>>>>>>    "processes":49001944,
>>> >>>>>>>>>    "processes_used":48963168,
>>> >>>>>>>>>    "binary":997032,
>>> >>>>>>>>>    "code":9233842,
>>> >>>>>>>>>    "ets":4779576
>>> >>>>>>>>>  },
>>> >>>>>>>>>  "run_queue":0,
>>> >>>>>>>>>  "ets_table_count":102,
>>> >>>>>>>>>  "context_switches":1015486,
>>> >>>>>>>>>  "reductions":111610788,
>>> >>>>>>>>>  "garbage_collection_count":74011,
>>> >>>>>>>>>  "words_reclaimed":239214127,
>>> >>>>>>>>>  "io_input":19881,
>>> >>>>>>>>>  "io_output":13118,
>>> >>>>>>>>>  ...
>>> >>>>>>>>>
>>> >>>>>>>>> Adrien
>>> >>>>>>>>>
>>> >>>>>>>>> Le ven. 14 juin 2019 à 15:11, Jérôme
Augé <
>>> >>>> jerome.auge@anakeen.com
>>> >>>>>>
>>> >>>>>> a
>>> >>>>>>>>> écrit :
>>> >>>>>>>>>
>>> >>>>>>>>>> Ok, so I'll setup a cron job to journalize
(every minute?)
>>> >>> the
>>> >>>>>> output
>>> >>>>>>>>> from
>>> >>>>>>>>>> "/_node/_local/_system" and wait for
the next OOM kill.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Any property from "_system" to look
for in particular?
>>> >>>>>>>>>>
>>> >>>>>>>>>> Here is a link to the memory usage graph:
>>> >>>>>>>>>> https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png
>>> >>>>>>>>>>
>>> >>>>>>>>>> The memory usage varies, but the general
trend is to go up
>>> >>> with
>>> >>>>>> some
>>> >>>>>>>>>> regularity over a week until we reach
OOM. When "beam.smp"
>>> >> is
>>> >>>>>> killed,
>>> >>>>>>>>> it's
>>> >>>>>>>>>> reported as consuming 15 GB (as seen
in the kernel's OOM
>>> >>> trace
>>> >>>> in
>>> >>>>>>>>> syslog).
>>> >>>>>>>>>>
>>> >>>>>>>>>> Thanks,
>>> >>>>>>>>>> Jérôme
>>> >>>>>>>>>>
>>> >>>>>>>>>> Le ven. 14 juin 2019 à 13:48, Adam
Kocoloski <
>>> >>>>> kocolosk@apache.org>
>>> >>>>>> a
>>> >>>>>>>>>> écrit :
>>> >>>>>>>>>>
>>> >>>>>>>>>>> Hi Jérôme,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Thanks for a well-written and detailed
report (though the
>>> >>>>> mailing
>>> >>>>>>>> list
>>> >>>>>>>>>>> strips attachments). The _system
endpoint provides a lot
>>> >> of
>>> >>>>>> useful
>>> >>>>>>>> data
>>> >>>>>>>>>> for
>>> >>>>>>>>>>> debugging these kinds of situations;
do you have a
>>> >> snapshot
>>> >>>> of
>>> >>>>>> the
>>> >>>>>>>>> output
>>> >>>>>>>>>>> when the system was consuming a
lot of memory?
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Adam
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> On Jun 14, 2019, at 5:44 AM,
Jérôme Augé <
>>> >>>>>>> jerome.auge@anakeen.com>
>>> >>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Hi,
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> I'm having a hard time figuring
out the high memory
>>> >> usage
>>> >>>> of
>>> >>>>> a
>>> >>>>>>>>> CouchDB
>>> >>>>>>>>>>> server.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> What I'm observing is that the
memory consumption from
>>> >>> the
>>> >>>>>>>> "beam.smp"
>>> >>>>>>>>>>> process gradually rises until it
triggers the kernel's
>>> >> OOM
>>> >>>>>>>>>> (Out-Of-Memory)
>>> >>>>>>>>>>> which kill the "beam.smp" process.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> It also seems that many databases
are not compacted:
>>> >> I've
>>> >>>>> made
>>> >>>>>> a
>>> >>>>>>>>> script
>>> >>>>>>>>>>> to iterate over the databases to
compute de fragmentation
>>> >>>>> factor,
>>> >>>>>>> and
>>> >>>>>>>>> it
>>> >>>>>>>>>>> seems I have around 2100 databases
with a frag > 70%.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> We have a single CouchDB v2.1.1server
(configured with
>>> >>> q=8
>>> >>>>> n=1)
>>> >>>>>>> and
>>> >>>>>>>>>>> around 2770 databases.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> The server initially had 4 GB
of RAM, and we are now
>>> >> with
>>> >>>> 16
>>> >>>>> GB
>>> >>>>>>> w/
>>> >>>>>>>> 8
>>> >>>>>>>>>>> vCPU, and it still regularly reaches
OOM. From the
>>> >>>> monitoring I
>>> >>>>>> see
>>> >>>>>>>>> that
>>> >>>>>>>>>>> with 16 GB the OOM is almost triggered
once per week
>>> >> (c.f.
>>> >>>>>> attached
>>> >>>>>>>>>> graph).
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> The memory usage seems to increase
gradually until it
>>> >>>> reaches
>>> >>>>>>> OOM.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> The Couch server is mostly used
by web clients with the
>>> >>>>> PouchDB
>>> >>>>>>> JS
>>> >>>>>>>>> API.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> We have ~1300 distinct users
and by monitoring the
>>> >>>>> netstat/TCP
>>> >>>>>>>>>>> established connections I guess
we have around 100
>>> >>> (maximum)
>>> >>>>>> users
>>> >>>>>>> at
>>> >>>>>>>>> any
>>> >>>>>>>>>>> given time. From what I understanding
of the
>>> >> application's
>>> >>>>> logic,
>>> >>>>>>>> each
>>> >>>>>>>>>> user
>>> >>>>>>>>>>> access 2 private databases (read/write)
+ 1 common
>>> >> database
>>> >>>>>>>>> (read-only).
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> On-disk usage of CouchDB's data
directory is around 40
>>> >>> GB.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Any ideas on what could cause
such behavior (increasing
>>> >>>>> memory
>>> >>>>>>>> usage
>>> >>>>>>>>>>> over the course of a week)? Or how
to find what is
>>> >>> happening
>>> >>>>>> behind
>>> >>>>>>>> the
>>> >>>>>>>>>>> scene?
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Regards,
>>> >>>>>>>>>>>> Jérôme
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>>
>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message