cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DOAN DuyHai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11383) Avoid index segment stitching in RAM which lead to OOM on big SSTable files
Date Sat, 26 Mar 2016 20:29:25 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213190#comment-15213190
] 

DOAN DuyHai commented on CASSANDRA-11383:
-----------------------------------------

Ok [~xedin], I'll retest *test case 2* with your fix (the branch is [here|https://github.com/xedin/cassandra/tree/CASSANDRA-11383]
right) ?

 This bug can kill down a cluster in a permanent state. Upon node reboot the index build kicks
in and falls into the exception against and kill the gossip stage again. Dropping the index
is not possible if you have some nodes marked {{DOWN}} (schema agreement fails). The only
work-around I've found to recover my cluster was:

1. reboot the {{DOWN}} node
2. execute quickly and repeatedly {{nodetool status}}
3. as soon as {{nodetool status}} is replying, issue quickly a {{nodetool stop INDEX_BUILD}}
before index build can kick in
4. repeat 1 to 3 on all nodes marked {{DOWN}}
5. wait for gossip to recover 
6. use cqlsh to drop index

> Avoid index segment stitching in RAM which lead to OOM on big SSTable files 
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11383
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11383
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CQL
>         Environment: C* 3.4
>            Reporter: DOAN DuyHai
>            Assignee: Jordan West
>              Labels: sasi
>             Fix For: 3.5
>
>         Attachments: CASSANDRA-11383.patch, SASI_Index_build_LCS_1G_Max_SSTable_Size_logs.tar.gz,
new_system_log_CMS_8GB_OOM.log, system.log_sasi_build_oom
>
>
> 13 bare metal machines
> - 6 cores CPU (12 HT)
> - 64Gb RAM
> - 4 SSD in RAID0
>  JVM settings:
> - G1 GC
> - Xms32G, Xmx32G
> Data set:
>  - ≈ 100Gb/per node
>  - 1.3 Tb cluster-wide
>  - ≈ 20Gb for all SASI indices
> C* settings:
> - concurrent_compactors: 1
> - compaction_throughput_mb_per_sec: 256
> - memtable_heap_space_in_mb: 2048
> - memtable_offheap_space_in_mb: 2048
> I created 9 SASI indices
>  - 8 indices with text field, NonTokenizingAnalyser,  PREFIX mode, case-insensitive
>  - 1 index with numeric field, SPARSE mode
>  After a while, the nodes just gone OOM.
>  I attach log files. You can see a lot of GC happening while index segments are flush
to disk. At some point the node OOM ...
> /cc [~xedin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message