Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8DA1A200AE4 for ; Wed, 11 May 2016 05:28:14 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8C73716098A; Wed, 11 May 2016 03:28:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D4C5F160A11 for ; Wed, 11 May 2016 05:28:13 +0200 (CEST) Received: (qmail 86876 invoked by uid 500); 11 May 2016 03:28:13 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 86757 invoked by uid 99); 11 May 2016 03:28:12 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 May 2016 03:28:12 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id CE5692C1F5C for ; Wed, 11 May 2016 03:28:12 +0000 (UTC) Date: Wed, 11 May 2016 03:28:12 +0000 (UTC) From: "Michael Fong (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 11 May 2016 03:28:14 -0000 Michael Fong created CASSANDRA-11748: ---------------------------------------- Summary: Schema version mismatch may leads to Casandra OOM at = bootstrap during a rolling upgrade process Key: CASSANDRA-11748 URL: https://issues.apache.org/jira/browse/CASSANDRA-11748 Project: Cassandra Issue Type: Bug Environment: Rolling upgrade process from 1.2.19 to 2.0.17.=20 CentOS 6.6 Reporter: Michael Fong We have observed multiple times when a multi-node C* (v2.0.17) cluster ran = into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.1= 7.=20 Here is the simple guideline of our rolling upgrade process 1. Update schema on a node, and wait until all nodes to be in schema versio= n agreemnt - via nodetool describeclulster 2. Restart a Cassandra node 3. After restart, there is a chance that the the restarted node has differe= nt schema version. 4. All nodes in cluster start to rapidly exchange schema information, and a= ny of node could run into OOM.=20 The following is the system.log that occur in one of our 2-node cluster tes= t bed ---------------------------------- Before rebooting node 2: Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.j= ava (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94= f58f Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.j= ava (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94= f58f After rebooting node 2,=20 Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 32= 8) Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b The node2 keeps submitting the migration task over 100+ times to the other= node. INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node= /192.168.88.33 has restarted, now UP INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) = Updating topology for /192.168.88.33 ... DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 1= 02) Submitting migration task for /192.168.88.33 ... ( over 100+ times) ---------------------------------- On the otherhand, Node 1 keeps updating its gossip information, followed by= receiving and submitting migrationTask afterwards:=20 INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 9= 78) InetAddress /192.168.88.34 is now UP ... DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandle= r.java (line 41) Received migration request from /192.168.88.34. =E2=80=A6=E2=80=A6 ( over 100+ times) DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line= 127) submitting migration task for /192.168.88.34 ..... (over 50+ times) On the side note, we have over 200+ column families defined in Cassandra da= tabase, which may related to this amount of rpc traffic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)