Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8AF11200B94 for ; Sat, 17 Sep 2016 17:39:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 89824160AB5; Sat, 17 Sep 2016 15:39:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CCA0B160ACD for ; Sat, 17 Sep 2016 17:39:21 +0200 (CEST) Received: (qmail 53440 invoked by uid 500); 17 Sep 2016 15:39:20 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 53426 invoked by uid 99); 17 Sep 2016 15:39:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Sep 2016 15:39:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id B08102C0D5C for ; Sat, 17 Sep 2016 15:39:20 +0000 (UTC) Date: Sat, 17 Sep 2016 15:39:20 +0000 (UTC) From: "Wei Deng (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-12655) Incremental repair & compaction hang on random nodes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 17 Sep 2016 15:39:22 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15499221#comment-15499221 ] Wei Deng commented on CASSANDRA-12655: -------------------------------------- This is duplicate to CASSANDRA-12200, which is currently being worked on by [~jjirsa]. It's not a bug per se, but the behavior indeed needs to be improved to avoid backlogged compactions (in your case, you can see you've got 66 pending for CompactionExecutor) from blocking anti-compaction for a long time so that trivial repair cannot finish. > Incremental repair & compaction hang on random nodes > ---------------------------------------------------- > > Key: CASSANDRA-12655 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12655 > Project: Cassandra > Issue Type: Bug > Components: Compaction > Environment: CentOS Linux release 7.1.1503 (Core) > RAM - 64GB > HEAP - 16GB > Load on each node - ~5GB > Cassandra Version - 2.2.5 > Reporter: Navjyot Nishant > Priority: Blocker > > Hi We are setting up incremental repair on our 18 node cluster. Avg load on each node is ~5GB. The repair run fine on couple of nodes and sudently get stuck on random nodes. Upon checking the system.log of impacted node we dont see much information. > Following are the lines we see in system.log and its there from the point repair is not making progress - > {code} > INFO [CompactionExecutor:3490] 2016-09-16 11:14:44,236 CompactionManager.java:1221 - Anticompacting [BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30832-big-Data.db'), BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30811-big-Data.db')] > INFO [IndexSummaryManager:1] 2016-09-16 11:14:49,954 IndexSummaryRedistribution.java:74 - Redistributing index summaries > INFO [IndexSummaryManager:1] 2016-09-16 12:14:49,961 IndexSummaryRedistribution.java:74 - Redistributing index summaries > {code} > When we try to see pending compaction by executing {code}nodetool compactionstats{code} it hangs as well and doesn't return anything. However {code}nodetool tpstats{code} show active and pending compaction which never come down and keep increasing. > {code} > Pool Name Active Pending Completed Blocked All time blocked > MutationStage 0 0 221208 0 0 > ReadStage 0 0 1288839 0 0 > RequestResponseStage 0 0 104356 0 0 > ReadRepairStage 0 0 72 0 0 > CounterMutationStage 0 0 0 0 0 > HintedHandoff 0 0 46 0 0 > MiscStage 0 0 0 0 0 > CompactionExecutor 8 66 68124 0 0 > MemtableReclaimMemory 0 0 166 0 0 > PendingRangeCalculator 0 0 38 0 0 > GossipStage 0 0 242455 0 0 > MigrationStage 0 0 0 0 0 > MemtablePostFlush 0 0 3682 0 0 > ValidationExecutor 0 0 2246 0 0 > Sampler 0 0 0 0 0 > MemtableFlushWriter 0 0 166 0 0 > InternalResponseStage 0 0 8866 0 0 > AntiEntropyStage 0 0 15417 0 0 > Repair#7 0 0 160 0 0 > CacheCleanupExecutor 0 0 0 0 0 > Native-Transport-Requests 0 0 327334 0 0 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > {code} nodetool netstats{code} shows some pending messages which never get processed and noting in progress - > {code} > Mode: NORMAL > Not sending any streams. > Read Repair Statistics: > Attempted: 15585 > Mismatch (Blocking): 0 > Mismatch (Background): 0 > Pool Name Active Pending Completed > Large messages n/a 12 562 > Small messages n/a 0 999779 > Gossip messages n/a 0 264394 > {code} > The only solution we have is bounce the node and all the pending compactions started getting processed immediately and get processed in 5 - 10 minutes. > This is a road blocker issue for us and and help in this matter would be highly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)