cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Fong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process
Date Wed, 11 May 2016 03:33:12 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279442#comment-15279442
] 

Michael Fong commented on CASSANDRA-11748:
------------------------------------------

The reason of why schema version would change after restart is yet unknown. However, having
different schema version and leading to flood Cassandra heap space seems pretty easy to reproduce.

All we have tried to do is
1. To block gossip communication between a 2-node cluster via iptables
2. Keep updating schema on a node and so schema version is different
3. Unblock the firewall setting
4. We would see the message storm on exchanging schema information, and Cassandra would possibly
run into OOM if it is small heap size.

P.S. It seems somewhat related to the number of schema change; the more the change, the greater
the scale of message exchange.

> Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade
process
> -----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11748
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
>            Reporter: Michael Fong
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran into OOM in
bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version agreemnt
- via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different schema
version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any of node
could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test bed
> ----------------------------------
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.java (line
328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.java (line
328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) Gossiping
my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node /192.168.88.33
has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating topology
for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting
migration task for /192.168.88.33
> ... ( over 100+ times)
> ----------------------------------
> On the otherhand, Node 1 keeps updating its gossip information, followed by receiving
and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 978) InetAddress
/192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandler.java (line
41) Received migration request from /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting
migration task for /192.168.88.34
> .....  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra database, which
may related to this amount of rpc traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message