cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Fong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process
Date Thu, 24 Aug 2017 04:08:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139527#comment-16139527
] 

Michael Fong commented on CASSANDRA-11748:
------------------------------------------

Hi, [~mbyrd]

Thanks for further looking into this problem. And, yes, above mentioned items were the symptoms
we observed (and suspected) when OOM issue happened. I totally agree on your analysis on this
particular problem - Thanks for answering them thoroughly! If we have located the log back
them, we will check and share it on this ticket. 

Thanks again for working on this issue!

> Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade
process
> -----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11748
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>            Reporter: Michael Fong
>            Assignee: Matt Byrd
>            Priority: Critical
>             Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran into OOM in
bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version agreemnt
- via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different schema
version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any of node
could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test bed
> ----------------------------------
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.java (line
328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.java (line
328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) Gossiping
my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node /192.168.88.33
has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating topology
for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting
migration task for /192.168.88.33
> ... ( over 100+ times)
> ----------------------------------
> On the otherhand, Node 1 keeps updating its gossip information, followed by receiving
and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 978) InetAddress
/192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandler.java (line
41) Received migration request from /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting
migration task for /192.168.88.34
> .....  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra database, which
may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have InternalResponseStage
performing schema merge operation. Since this operation requires a compaction for each merge
and is much slower to consume. Thus, the back-pressure of incoming schema migration content
objects consumes all of the heap space and ultimately ends up OOM!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message