Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: <CY1PR08MB17405A0EB752C3015A4ADEAFF5720@CY1PR08MB1740.namprd08.prod.outlook.com>
References: <CY1PR08MB1740E492ADE8829CEE787F76F5700@CY1PR08MB1740.namprd08.prod.outlook.com>
 <B9C33AE0-DDBA-49D6-BBF7-E6549F26D9EE@internalcircle.com> <CY1PR08MB17405A0EB752C3015A4ADEAFF5720@CY1PR08MB1740.namprd08.prod.outlook.com>
From: Alain RODRIGUEZ <arodrime@gmail.com>
Date: Wed, 11 May 2016 16:01:13 +0200
Message-ID: <CA+VSrLo5bK_-FUZkqcE9yhg3wS9z4bfaa8Rd5M7NwQ7nmO8MxQ@mail.gmail.com>
Subject: Re: Cassandra 2.0.x OOM during startsup - schema version
 inconsistency after reboot
To: user@cassandra.apache.org
Cc: "dev@cassandra.apache.org" <dev@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=001a114bc2d4a374620532917a31
archived-at: Wed, 11 May 2016 14:01:50 -0000

--001a114bc2d4a374620532917a31
Content-Type: text/plain; charset=UTF-8

Hi Michaels :-),

My guess is this ticket will be closed with a "Won't Fix" resolution.

Cassandra 2.0 is no longer supported and I have seen tickets being rejected
like CASSANDRA-10510 <https://issues.apache.org/jira/browse/CASSANDRA-10510>
.

Would you like to upgrade to 2.1.last and see if you still have the issue?

About your issue, do you stop your node using a command like the following
one?

nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool
disablegossip && sleep 10 && nodetool drain && sleep 10 && sudo service
cassandra stop

or even flushing:

nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool
disablegossip && sleep 10 && nodetool flush && nodetool drain && sleep 10
&& sudo service cassandra stop

Are commitlogs empty when you start cassandra?

C*heers,

-----------------------
Alain Rodriguez - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-05-11 5:35 GMT+02:00 Michael Fong <michael.fong@ruckuswireless.com>:

> Hi,
>
> Thanks for your recommendation.
> I also opened a ticket to keep track @
> https://issues.apache.org/jira/browse/CASSANDRA-11748
> Hope this could brought someone's attention to take a look. Thanks.
>
> Sincerely,
>
> Michael Fong
>
> -----Original Message-----
> From: Michael Kjellman [mailto:mkjellman@internalcircle.com]
> Sent: Monday, May 09, 2016 11:57 AM
> To: dev@cassandra.apache.org
> Cc: user@cassandra.apache.org
> Subject: Re: Cassandra 2.0.x OOM during startsup - schema version
> inconsistency after reboot
>
> I'd recommend you create a JIRA! That way you can get some traction on the
> issue. Obviously an OOM is never correct, even if your process is wrong in
> some way!
>
> Best,
> kjellman
>
> Sent from my iPhone
>
> > On May 8, 2016, at 8:48 PM, Michael Fong <
> michael.fong@ruckuswireless.com> wrote:
> >
> > Hi, all,
> >
> >
> > Haven't heard any responses so far, and this isue has troubled us for
> quite some time. Here is another update:
> >
> > We have noticed several times that The schema version may change after
> migration and reboot:
> >
> > Here is the scenario:
> >
> > 1.       Two node cluster (1 & 2).
> >
> > 2.       There are some schema changes, i.e. create a few new
> columnfamily. The cluster will wait until both nodes have schema version in
> sync (describe cluster) before moving on.
> >
> > 3.       Right before node2 is rebooted, the schema version is
> consistent; however, after ndoe2 reboots and starts servicing, the
> MigrationManager would gossip different schema version.
> >
> > 4.       Afterwards, both nodes starts exchanging schema  message
> indefinitely until one of the node dies.
> >
> > We currently suspect the change of schema is due to replying the old
> entry in commit log. We wish to continue dig further, but need experts help
> on this.
> >
> > I don't know if anyone has seen this before, or if there is anything
> wrong with our migration flow though..
> >
> > Thanks in advance.
> >
> > Best regards,
> >
> >
> > Michael Fong
> >
> > From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> > Sent: Thursday, April 21, 2016 6:41 PM
> > To: user@cassandra.apache.org; dev@cassandra.apache.org
> > Subject: RE: Cassandra 2.0.x OOM during bootstrap
> >
> > Hi, all,
> >
> > Here is some more information on before the OOM happened on the rebooted
> node in a 2-node test cluster:
> >
> >
> > 1.       It seems the schema version has changed on the rebooted node
> after reboot, i.e.
> > Before reboot,
> > Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326
> > MigrationManager.java (line 328) Gossiping my schema version
> > 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> > Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122
> > MigrationManager.java (line 328) Gossiping my schema version
> > 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> >
> > After rebooting node 2,
> > Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java
> > (line 328) Gossiping my schema version
> > f5270873-ba1f-39c7-ab2e-a86db868b09b
> >
> >
> >
> > 2.       After reboot, both nods repeatedly send MigrationTask to each
> other - we suspect it is related to the schema version (Digest) mismatch
> after Node 2 rebooted:
> > The node2  keeps submitting the migration task over 100+ times to the
> other node.
> > INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011)
> > Node /192.168.88.33 has restarted, now UP INFO [GossipStage:1]
> > 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating
> > topology for /192.168.88.33 INFO [GossipStage:1] 2016-04-19
> > 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33 state
> > jump to normal INFO [GossipStage:1] 2016-04-19 11:18:18,264
> > TokenMetadata.java (line 414) Updating topology for /192.168.88.33
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line
> 102) Submitting migration task for /192.168.88.33 DEBUG [GossipStage:1]
> 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting
> migration task for /192.168.88.33 DEBUG [MigrationStage:1] 2016-04-19
> 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request:
> node /192.168.88.33 is down.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java
> (line 62) Can't send schema pull request: node /192.168.88.33 is down.
> > DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java
> > (line 977) removing expire time for endpoint : /192.168.88.33 INFO
> > [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line
> > 978) InetAddress /192.168.88.33 is now UP DEBUG
> > [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java
> > (line 102) Submitting migration task for /192.168.88.33 DEBUG
> > [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line
> > 977) removing expire time for endpoint : /192.168.88.33 INFO
> > [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line
> > 978) InetAddress /192.168.88.33 is now UP DEBUG
> [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java
> (line 102) Submitting migration task for /192.168.88.33 DEBUG
> [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977)
> removing expire time for endpoint : /192.168.88.33 INFO
> [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978)
> InetAddress /192.168.88.33 is now UP DEBUG [RequestResponseStage:2]
> 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting
> migration task for /192.168.88.33 .....
> >
> >
> > On the otherhand, Node 1 keeps updating its gossip information, followed
> by receiving and submitting migrationTask afterwards:
> > DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java
> > (line 977) removing expire time for endpoint : /192.168.88.34 INFO
> > [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line
> > 978) InetAddress /192.168.88.34 is now UP DEBUG
> > [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line
> > 977) removing expire time for endpoint : /192.168.88.34 INFO
> > [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line
> 978) InetAddress /192.168.88.34 is now UP DEBUG [RequestResponseStage:3]
> 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for
> endpoint : /192.168.88.34 INFO [RequestResponseStage:3] 2016-04-19
> 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now
> UP ......
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > ......
> > DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java
> > (line 127) submitting migration task for /192.168.88.34 DEBUG
> > [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line
> > 127) submitting migration task for /192.168.88.34 DEBUG
> [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127)
> submitting migration task for /192.168.88.34 .....
> >
> > Has anyone experienced this scenario? Thanks in advanced!
> >
> > Sincerely,
> >
> > Michael Fong
> >
> > From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> > Sent: Wednesday, April 20, 2016 10:43 AM
> > To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>;
> > dev@cassandra.apache.org<mailto:dev@cassandra.apache.org>
> > Subject: Cassandra 2.0.x OOM during bootstrap
> >
> > Hi, all,
> >
> > We have recently encountered a Cassandra OOM issue when Cassandra is
> brought up sometimes (but not always) in our 4-node cluster test bed.
> >
> > After analyzing the heap dump, we could find the Internal-Response
> thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of
> 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of
> heap memory.
> >
> > According to the documents on internet, it seems internal-response
> thread pool is somewhat related to schema-checking. Has anyone encountered
> similar issue before?
> >
> > We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!
> >
> > Sincerely,
> >
> > Michael Fong
>

--001a114bc2d4a374620532917a31--