couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Cottlehuber (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COUCHDB-1946) Trying to replicate NPM grinds to a halt after 40GB
Date Wed, 04 Dec 2013 23:28:35 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839477#comment-13839477
] 

Dave Cottlehuber commented on COUCHDB-1946:
-------------------------------------------

hey folks,

Something does happen around 30-40GB, I have replicated the situation here.

In both your issues, the final error is termination of couchjs process (used for map/reduce
and any
javascripty stuff) by OOM -- you can see things like this, however that's not the root cause.

{code}
[Tue, 03 Dec 2013 01:47:46 GMT] [error] [<0.368.0>] OS Process died with status: 137
[Tue, 03 Dec 2013 01:47:47 GMT] [error] [<0.368.0>] ** Generic server <0.368.0>
terminating 
** Last message in was {#Port<0.2771>,{exit_status,137}}
** When Server state == {os_proc,"/usr/bin/couchjs /usr/share/couchdb/server/main.js",
                                 #Port<0.2771>,
                                 #Fun<couch_os_process.2.132569728>,
                                 #Fun<couch_os_process.3.35601548>,5000}
{code}

This is the issue you're hitting, and you need to at least ensure ulimits etc
are set appropriately for the couchdb user:

    http://wiki.apache.org/couchdb/Performance

You can optionally force OOMkiller off for the beam.smp process directly;

{code}
# assuming you run only one erlang VM at a time
echo '-1000' > /proc/`pgrep beam.smp`/oom_score_adj
{code}

It can be helpful to increase os process timeout, although that's not
what you're hitting here from what I see.

{code}
 curl -XPUT https://admin:passwd@localhost:5984/_config/couchdb/os_process_timeout -d '"60000"'
{code}

However, there *is* some other issue here, before the OOM killer goes awol, CouchDB
starts consuming a lot of memory. My instance cruises along < 300MB RES RAM for
most of the replication, and then starts shooting up very rapidly past 3GB and then its
all over red rover. I'm working on tracking down exactly what this is but it's somewhat
tricky given the amount of concurrent stuff happening.

{code}
  Node: npm@z1 (Connected) (R15B01/5.9.1) unix (linux 3.2.0) CPU:4 SMP +A:16 +K
Time: local time 15:04:41, up for 000:00:33:51, 0ms latency,
Processes: total 504 (RQ 42) at 53424637 RpI using 63038.5k (63084.9k allocated)
Memory: Sys 2490.3m, Atom 264.9k/266.3k, Bin 2480.2m, Code 6292.2k, Ets 885.3k

Interval 1000ms, Sorting on "HTot" (Descending), Retrieved in 15ms
         Pid Registered Name      Reductions   MQueue HSize  SSize  HTot
  <0.4203.0> -                    459031       0      514229 9      832040
  <0.5277.0> -                    31099        0      121393 19     439204
     <0.6.0> error_logger         17874687     0      46368  8      364179
  <0.5840.0> -                    367940       1      28657  189    225075
  <0.6295.0> -


Node: npm@z1 (Connected) (R15B01/5.9.1) unix (linux 3.2.0) CPU:4 SMP +A:16 +K
Time: local time 15:04:22, up for 000:00:33:32, 0ms latency,
Processes: total 504 (RQ 46) at 53654512 RpI using 62889.0k (62899.1k allocated)
Memory: Sys 2075.6m, Atom 264.9k/266.3k, Bin 2065.5m, Code 6292.2k, Ets 880.0k

Interval 1000ms, Sorting on "SSize" (Descending), Retrieved in 17ms
         Pid Registered Name      Reductions   MQueue HSize  SSize  HTot
  <0.6295.0> -                    351722       0      28657  204    225075
  <0.5840.0> -                    340874       0      28657  158    225075
  <0.5063.0> -                    302674       1      17711  118    139104
  <0.4434.0> -                    69314        0      10946  86     85971
  <0.5062.0> -
  
  
Node: npm@z1 (Connected) (R15B01/5.9.1) unix (linux 3.2.0) CPU:4 SMP +A:16 +K
Time: local time 15:06:02, up for 000:00:35:12, 0ms latency,
Processes: total 496 (RQ 42) at 54048301 RpI using 58790.1k (58810.5k allocated)
Memory: Sys 2904.0m, Atom 264.9k/266.3k, Bin 2893.9m, Code 6292.2k, Ets 908.1k

Interval 1000ms, Sorting on "HSize" (Descending), Retrieved in 18ms
         Pid Registered Name      Reductions   MQueue HSize  SSize  HTot
  <0.4203.0> -                    459031       0      514229 9      832040
  <0.5002.0> -                    31564        0      196418 19     196418
  <0.5663.0> -                    36248        0      196418 19     196418
    <0.12.0> rex                  1330480      0      121393 9      121770
  <0.3919.0> couch_stats_a
  {code}

The last 15 minutes before the crash, the message queue for one process appears stuck
at 10:  `MQueue 10 <0.4203.0>` which could be normal or not, but I do think its weird.

atm npm itself is not letting me replicate so I can't debug further until its available again.


> Trying to replicate NPM grinds to a halt after 40GB
> ---------------------------------------------------
>
>                 Key: COUCHDB-1946
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1946
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>            Reporter: Marc Trudel
>         Attachments: couch.log
>
>
> I have been able to replicate the Node.js NPM database until 40G or so, then I get this:
> https://gist.github.com/stelcheck/7723362
> I one case I have gotten a flat-out OOM error, but I didn't take a dump of the log output
at the time.
> CentOS6.4 with CouchDB 1.5 (also tried 1.3.1, but to no avail). Also tried to restart
replication from scratch - twice - bot cases stalling at 40GB.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message