cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei Deng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12655) Incremental repair & compaction hang on random nodes
Date Sat, 17 Sep 2016 15:52:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15499243#comment-15499243
] 

Wei Deng commented on CASSANDRA-12655:
--------------------------------------

BTW, in addition to the behavior that anti-compaction getting blocked by other regular compactions,
you may have run into some other compaction hang issue with the regular compactions (especially
when you say "nodetool compactionstats" also hang forever), as Marcus pointed out. When you
run into this issue again, if you can always see CPU being completely idle with no CPU core
working on any regular compaction threads while anti-compaction is still blocked by all pending
regular compactions, then you've likely run into one of those compaction hang bugs in earlier
version of 2.2.x.

Definitely moving to the latest 2.2.x version will help to avoid those known and fixed compaction
hang problems and that should be a required first step. Then you will need to wait for the
improvement in CASSANDRA-12200 to completely avoid trivial repair from being blocked by backlogged
compactions. As CASSANDRA-12200 is an improvement instead of a bug and likely will not go
into 2.2, you likely will need to plan to cherry-pick the fix and back-port to your own 2.2
version, if you don't have plan to go to 3.x shortly.

> Incremental repair & compaction hang on random nodes
> ----------------------------------------------------
>
>                 Key: CASSANDRA-12655
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12655
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction
>         Environment: CentOS Linux release 7.1.1503 (Core)
> RAM - 64GB
> HEAP - 16GB
> Load on each node - ~5GB
> Cassandra Version - 2.2.5
>            Reporter: Navjyot Nishant
>            Priority: Blocker
>
> Hi We are setting up incremental repair on our 18 node cluster. Avg load on each node
is ~5GB. The repair run fine on couple of nodes and sudently get stuck on random nodes. Upon
checking the system.log of impacted node we dont see much information.
> Following are the lines we see in system.log and its there from the point repair is not
making progress -
> {code}
> INFO  [CompactionExecutor:3490] 2016-09-16 11:14:44,236 CompactionManager.java:1221 -
Anticompacting [BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30832-big-Data.db'),
BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30811-big-Data.db')]
> INFO  [IndexSummaryManager:1] 2016-09-16 11:14:49,954 IndexSummaryRedistribution.java:74
- Redistributing index summaries
> INFO  [IndexSummaryManager:1] 2016-09-16 12:14:49,961 IndexSummaryRedistribution.java:74
- Redistributing index summaries
> {code}
> When we try to see pending compaction by executing {code}nodetool compactionstats{code}
it hangs as well and doesn't return anything. However {code}nodetool tpstats{code} show active
and pending compaction which never come down and keep increasing. 
> {code}
> Pool Name                    Active   Pending      Completed   Blocked  All time blocked
> MutationStage                     0         0         221208         0              
  0
> ReadStage                         0         0        1288839         0              
  0
> RequestResponseStage              0         0         104356         0              
  0
> ReadRepairStage                   0         0             72         0              
  0
> CounterMutationStage              0         0              0         0              
  0
> HintedHandoff                     0         0             46         0              
  0
> MiscStage                         0         0              0         0              
  0
> CompactionExecutor                8        66          68124         0              
  0
> MemtableReclaimMemory             0         0            166         0              
  0
> PendingRangeCalculator            0         0             38         0              
  0
> GossipStage                       0         0         242455         0              
  0
> MigrationStage                    0         0              0         0              
  0
> MemtablePostFlush                 0         0           3682         0              
  0
> ValidationExecutor                0         0           2246         0              
  0
> Sampler                           0         0              0         0              
  0
> MemtableFlushWriter               0         0            166         0              
  0
> InternalResponseStage             0         0           8866         0              
  0
> AntiEntropyStage                  0         0          15417         0              
  0
> Repair#7                          0         0            160         0              
  0
> CacheCleanupExecutor              0         0              0         0              
  0
> Native-Transport-Requests         0         0         327334         0              
  0
> Message type           Dropped
> READ                         0
> RANGE_SLICE                  0
> _TRACE                       0
> MUTATION                     0
> COUNTER_MUTATION             0
> REQUEST_RESPONSE             0
> PAGED_RANGE                  0
> READ_REPAIR                  0
> {code}
> {code} nodetool netstats{code} shows some pending messages which never get processed
and noting in progress -
> {code}
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 15585
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool Name                    Active   Pending      Completed
> Large messages                  n/a        12            562
> Small messages                  n/a         0         999779
> Gossip messages                 n/a         0         264394
> {code}
> The only solution we have is bounce the node and all the pending compactions started
getting processed immediately and get processed in 5 - 10 minutes.
> This is a road blocker issue for us and and help in this matter would be highly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message