Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Sat, 17 Sep 2016 15:39:20 +0000 (UTC)
From: "Wei Deng (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.13005431.1474031293000.596954.1474126760719@Atlassian.JIRA>
In-Reply-To: <JIRA.13005431.1474031293000@Atlassian.JIRA>
References: <JIRA.13005431.1474031293000@Atlassian.JIRA> <JIRA.13005431.1474031293374@arcas>
Subject: [jira] [Commented] (CASSANDRA-12655) Incremental repair &
 compaction hang on random nodes
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Sat, 17 Sep 2016 15:39:22 -0000


    [ https://issues.apache.org/jira/browse/CASSANDRA-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15499221#comment-15499221 ] 

Wei Deng commented on CASSANDRA-12655:
--------------------------------------

This is duplicate to CASSANDRA-12200, which is currently being worked on by [~jjirsa]. It's not a bug per se, but the behavior indeed needs to be improved to avoid backlogged compactions (in your case, you can see you've got 66 pending for CompactionExecutor) from blocking anti-compaction for a long time so that trivial repair cannot finish.

> Incremental repair & compaction hang on random nodes
> ----------------------------------------------------
>
>                 Key: CASSANDRA-12655
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12655
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction
>         Environment: CentOS Linux release 7.1.1503 (Core)
> RAM - 64GB
> HEAP - 16GB
> Load on each node - ~5GB
> Cassandra Version - 2.2.5
>            Reporter: Navjyot Nishant
>            Priority: Blocker
>
> Hi We are setting up incremental repair on our 18 node cluster. Avg load on each node is ~5GB. The repair run fine on couple of nodes and sudently get stuck on random nodes. Upon checking the system.log of impacted node we dont see much information.
> Following are the lines we see in system.log and its there from the point repair is not making progress -
> {code}
> INFO  [CompactionExecutor:3490] 2016-09-16 11:14:44,236 CompactionManager.java:1221 - Anticompacting [BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30832-big-Data.db'), BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30811-big-Data.db')]
> INFO  [IndexSummaryManager:1] 2016-09-16 11:14:49,954 IndexSummaryRedistribution.java:74 - Redistributing index summaries
> INFO  [IndexSummaryManager:1] 2016-09-16 12:14:49,961 IndexSummaryRedistribution.java:74 - Redistributing index summaries
> {code}
> When we try to see pending compaction by executing {code}nodetool compactionstats{code} it hangs as well and doesn't return anything. However {code}nodetool tpstats{code} show active and pending compaction which never come down and keep increasing. 
> {code}
> Pool Name                    Active   Pending      Completed   Blocked  All time blocked
> MutationStage                     0         0         221208         0                 0
> ReadStage                         0         0        1288839         0                 0
> RequestResponseStage              0         0         104356         0                 0
> ReadRepairStage                   0         0             72         0                 0
> CounterMutationStage              0         0              0         0                 0
> HintedHandoff                     0         0             46         0                 0
> MiscStage                         0         0              0         0                 0
> CompactionExecutor                8        66          68124         0                 0
> MemtableReclaimMemory             0         0            166         0                 0
> PendingRangeCalculator            0         0             38         0                 0
> GossipStage                       0         0         242455         0                 0
> MigrationStage                    0         0              0         0                 0
> MemtablePostFlush                 0         0           3682         0                 0
> ValidationExecutor                0         0           2246         0                 0
> Sampler                           0         0              0         0                 0
> MemtableFlushWriter               0         0            166         0                 0
> InternalResponseStage             0         0           8866         0                 0
> AntiEntropyStage                  0         0          15417         0                 0
> Repair#7                          0         0            160         0                 0
> CacheCleanupExecutor              0         0              0         0                 0
> Native-Transport-Requests         0         0         327334         0                 0
> Message type           Dropped
> READ                         0
> RANGE_SLICE                  0
> _TRACE                       0
> MUTATION                     0
> COUNTER_MUTATION             0
> REQUEST_RESPONSE             0
> PAGED_RANGE                  0
> READ_REPAIR                  0
> {code}
> {code} nodetool netstats{code} shows some pending messages which never get processed and noting in progress -
> {code}
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 15585
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool Name                    Active   Pending      Completed
> Large messages                  n/a        12            562
> Small messages                  n/a         0         999779
> Gossip messages                 n/a         0         264394
> {code}
> The only solution we have is bounce the node and all the pending compactions started getting processed immediately and get processed in 5 - 10 minutes.
> This is a road blocker issue for us and and help in this matter would be highly appreciated.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)