cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Griffith (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
Date Fri, 23 Oct 2015 17:35:28 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971417#comment-14971417
] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:34 PM:
---------------------------------------------------------------------

[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G setting but
with different behavior. I have captured the whole thing with thread dumps and tpstats every
two minutes. I've embedded pending numbers in the filenames for your convenience to make it
easy to see where the backup starts. *-node1.tar.gz is the only one i uploaded since the files
were so large, but note in the Dashboard.jpg file that all three nodes break the limit at
about the same time. I can upload the others if it is useful. This case seems different from
the previous case where there were lots of L0 files causing thread blocking, but even here
it seems like the MemtablePostFlush is stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

This happened twice during this period and here is the first one. Note the pid changed because
our monitoring detected and restarted the node.

{code}
tpstats_20151023-00:16:02_pid_37996_postpend_0.txt
tpstats_20151023-00:18:08_pid_37996_postpend_1.txt
tpstats_20151023-00:20:14_pid_37996_postpend_0.txt
tpstats_20151023-00:22:19_pid_37996_postpend_3.txt
tpstats_20151023-00:24:25_pid_37996_postpend_133.txt
tpstats_20151023-00:26:30_pid_37996_postpend_809.txt
tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt
tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt
tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt
tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt
tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt
tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt
tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt
tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt
tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt
tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt
tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt
tpstats_20151023-00:51:11_pid_48196_postpend_0.txt
tpstats_20151023-00:53:13_pid_48196_postpend_0.txt
tpstats_20151023-00:55:16_pid_48196_postpend_0.txt
tpstats_20151023-00:57:21_pid_48196_postpend_0.txt

{code}



was (Author: jeffery.griffith):
[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G setting but
with different behavior. I have captured the whole thing with thread dumps and tpstats every
two minutes. I've embedded pending numbers in the filenames for your convenience to make it
easy to see where the backup starts. *-node1.tar.gz is the only one i uploaded since the files
were so large, but note in the Dashboard.jpg file that all three nodes break the limit at
about the same time. I can upload the others if it is useful. This case seems different from
the previous case where there were lots of L0 files causing thread blocking, but even here
it seems like the MemtablePostFlush is stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

This happened twice during this period and here is the first one. Note the pid changed because
our monitoring detected and restarted the node.

{code}
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:16 tpstats_20151023-00:16:02_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:18 tpstats_20151023-00:18:08_pid_37996_postpend_1.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:20 tpstats_20151023-00:20:14_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:22 tpstats_20151023-00:22:19_pid_37996_postpend_3.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:24 tpstats_20151023-00:24:25_pid_37996_postpend_133.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:26 tpstats_20151023-00:26:30_pid_37996_postpend_809.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:28 tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:30 tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:32 tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:34 tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:36 tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:38 tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:40 tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:43 tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:45 tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:47 tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:49 tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:51 tpstats_20151023-00:51:11_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:53 tpstats_20151023-00:53:13_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:55 tpstats_20151023-00:55:16_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:57 tpstats_20151023-00:57:21_pid_48196_postpend_0.txt

{code}


> Commit logs back up with move to 2.1.10
> ---------------------------------------
>
>                 Key: CASSANDRA-10515
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: redhat 6.5, cassandra 2.1.10
>            Reporter: Jeff Griffith
>            Assignee: Branimir Lambov
>            Priority: Critical
>              Labels: commitlog, triage
>         Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, CommitLogSize.jpg,
MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt,
stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems where some nodes
break the 12G commit log max we configured and go as high as 65G or more before it restarts.
Once it reaches the state of more than 12G commit log files, "nodetool compactionstats" hangs.
Eventually C* restarts without errors (not sure yet whether it is crashing but I'm checking
into it) and the cleanup occurs and the commit logs shrink back down again. Here is the nodetool
compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>    compaction type   keyspace                          table     completed          total
   unit   progress
>         Compaction   SyncCore                          *cf1*   61251208033   170643574558
  bytes     35.89%
>         Compaction   SyncCore                          *cf2*   19262483904    19266079916
  bytes     99.98%
>         Compaction   SyncCore                          *cf3*    6592197093     6592316682
  bytes    100.00%
>         Compaction   SyncCore                          *cf4*    3411039555     3411039557
  bytes    100.00%
>         Compaction   SyncCore                          *cf5*    2879241009     2879487621
  bytes     99.99%
>         Compaction   SyncCore                          *cf6*   21252493623    21252635196
  bytes    100.00%
>         Compaction   SyncCore                          *cf7*   81009853587    81009854438
  bytes    100.00%
>         Compaction   SyncCore                          *cf8*    3005734580     3005768582
  bytes    100.00%
> Active compaction remaining time :        n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being logged
in system.log on the StatusLogger thread until after the compaction started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message