Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 728D62004CA for ; Wed, 11 May 2016 16:01:50 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 71025160A14; Wed, 11 May 2016 14:01:50 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6F6C3160A0B for ; Wed, 11 May 2016 16:01:49 +0200 (CEST) Received: (qmail 75784 invoked by uid 500); 11 May 2016 14:01:45 -0000 Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list dev@cassandra.apache.org Received: (qmail 74930 invoked by uid 99); 11 May 2016 14:01:44 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 May 2016 14:01:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 4883E1A089B; Wed, 11 May 2016 14:01:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.198 X-Spam-Level: * X-Spam-Status: No, score=1.198 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id gFwSt2MmAXM8; Wed, 11 May 2016 14:01:40 +0000 (UTC) Received: from mail-vk0-f44.google.com (mail-vk0-f44.google.com [209.85.213.44]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 0885F5F24B; Wed, 11 May 2016 14:01:40 +0000 (UTC) Received: by mail-vk0-f44.google.com with SMTP id o133so57992845vka.0; Wed, 11 May 2016 07:01:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=SG1O0aw+RjMqXPbgl3BKz5Eh4NjOzyLyws1n6y9NU48=; b=P4/+GlltVaRSfzn3uhC2SiasWPZBEyQRf7LnXD/QjjHp7ajF4b5KcFvoyMVxGKqjxh 4h5AfaHIFMdX5axUDllj+FJW9qSN3ani10QysukEus/ztmaBleQduMwiFreOaj6Uvymc fayY5NUMWFmm/v7A9EoVjx/ld2u9nIeUSqETSwSIZop/hLbxii2Jk7RYY++xS7P5t6B5 CScqm+rmanYLgj/3Ikr3PHup01GlrgOGpeUfZ+CcfrRKugM7yYOwMGaqA7S70Zf0u4lZ O2E4cSlj40dEMdPd8t81yZz6zBFjxf9x/eBVr/njpDf3LtnLWL2kZUh1sPw0TQS67zE0 w4Pw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=SG1O0aw+RjMqXPbgl3BKz5Eh4NjOzyLyws1n6y9NU48=; b=S+LRFweyhGFxdDlIMR5w4lcONPTH+LaZc4+2kFjMOKPu+p2wAdm21TBJNewdaVyzhi mZwdu6BSYoigxMB2KCmp9NsC2F5VIoQKEwMR5oh2mUHZZhnnJT0eI++t88QMXNKW4xUu NC48QhTJxzNxFa/CwszTMjwFzVVI/ajagtp2x5O8muDbGd0weZhxFICVloMiNnOKMySw +sez/MoWpCZzDMj4m8Zn34rqXbAkmUi3dTRYoQOVEP/hqTOoYuLKwS2I928Kp6xIGdgJ smf3Y+IEm+jM0WdjYREy3+jDNEBcA3bDri6X9GbTQONHF6Q+6QXbwtFzH+W2Xnir4JgF WWxQ== X-Gm-Message-State: AOPr4FW/aefxS5S53DnWai23J8Istia4nfyi4RSoIWCvWcOns+4fesgUX9L40ZkDW2rXyEHwJKqX/jrnW8TTQg== X-Received: by 10.31.210.65 with SMTP id j62mr1777774vkg.133.1462975292730; Wed, 11 May 2016 07:01:32 -0700 (PDT) MIME-Version: 1.0 Received: by 10.159.35.167 with HTTP; Wed, 11 May 2016 07:01:13 -0700 (PDT) In-Reply-To: References: From: Alain RODRIGUEZ Date: Wed, 11 May 2016 16:01:13 +0200 Message-ID: Subject: Re: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot To: user@cassandra.apache.org Cc: "dev@cassandra.apache.org" Content-Type: multipart/alternative; boundary=001a114bc2d4a374620532917a31 archived-at: Wed, 11 May 2016 14:01:50 -0000 --001a114bc2d4a374620532917a31 Content-Type: text/plain; charset=UTF-8 Hi Michaels :-), My guess is this ticket will be closed with a "Won't Fix" resolution. Cassandra 2.0 is no longer supported and I have seen tickets being rejected like CASSANDRA-10510 . Would you like to upgrade to 2.1.last and see if you still have the issue? About your issue, do you stop your node using a command like the following one? nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool disablegossip && sleep 10 && nodetool drain && sleep 10 && sudo service cassandra stop or even flushing: nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool disablegossip && sleep 10 && nodetool flush && nodetool drain && sleep 10 && sudo service cassandra stop Are commitlogs empty when you start cassandra? C*heers, ----------------------- Alain Rodriguez - alain@thelastpickle.com France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-05-11 5:35 GMT+02:00 Michael Fong : > Hi, > > Thanks for your recommendation. > I also opened a ticket to keep track @ > https://issues.apache.org/jira/browse/CASSANDRA-11748 > Hope this could brought someone's attention to take a look. Thanks. > > Sincerely, > > Michael Fong > > -----Original Message----- > From: Michael Kjellman [mailto:mkjellman@internalcircle.com] > Sent: Monday, May 09, 2016 11:57 AM > To: dev@cassandra.apache.org > Cc: user@cassandra.apache.org > Subject: Re: Cassandra 2.0.x OOM during startsup - schema version > inconsistency after reboot > > I'd recommend you create a JIRA! That way you can get some traction on the > issue. Obviously an OOM is never correct, even if your process is wrong in > some way! > > Best, > kjellman > > Sent from my iPhone > > > On May 8, 2016, at 8:48 PM, Michael Fong < > michael.fong@ruckuswireless.com> wrote: > > > > Hi, all, > > > > > > Haven't heard any responses so far, and this isue has troubled us for > quite some time. Here is another update: > > > > We have noticed several times that The schema version may change after > migration and reboot: > > > > Here is the scenario: > > > > 1. Two node cluster (1 & 2). > > > > 2. There are some schema changes, i.e. create a few new > columnfamily. The cluster will wait until both nodes have schema version in > sync (describe cluster) before moving on. > > > > 3. Right before node2 is rebooted, the schema version is > consistent; however, after ndoe2 reboots and starts servicing, the > MigrationManager would gossip different schema version. > > > > 4. Afterwards, both nodes starts exchanging schema message > indefinitely until one of the node dies. > > > > We currently suspect the change of schema is due to replying the old > entry in commit log. We wish to continue dig further, but need experts help > on this. > > > > I don't know if anyone has seen this before, or if there is anything > wrong with our migration flow though.. > > > > Thanks in advance. > > > > Best regards, > > > > > > Michael Fong > > > > From: Michael Fong [mailto:michael.fong@ruckuswireless.com] > > Sent: Thursday, April 21, 2016 6:41 PM > > To: user@cassandra.apache.org; dev@cassandra.apache.org > > Subject: RE: Cassandra 2.0.x OOM during bootstrap > > > > Hi, all, > > > > Here is some more information on before the OOM happened on the rebooted > node in a 2-node test cluster: > > > > > > 1. It seems the schema version has changed on the rebooted node > after reboot, i.e. > > Before reboot, > > Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 > > MigrationManager.java (line 328) Gossiping my schema version > > 4cb463f8-5376-3baf-8e88-a5cc6a94f58f > > Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 > > MigrationManager.java (line 328) Gossiping my schema version > > 4cb463f8-5376-3baf-8e88-a5cc6a94f58f > > > > After rebooting node 2, > > Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java > > (line 328) Gossiping my schema version > > f5270873-ba1f-39c7-ab2e-a86db868b09b > > > > > > > > 2. After reboot, both nods repeatedly send MigrationTask to each > other - we suspect it is related to the schema version (Digest) mismatch > after Node 2 rebooted: > > The node2 keeps submitting the migration task over 100+ times to the > other node. > > INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) > > Node /192.168.88.33 has restarted, now UP INFO [GossipStage:1] > > 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating > > topology for /192.168.88.33 INFO [GossipStage:1] 2016-04-19 > > 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33 state > > jump to normal INFO [GossipStage:1] 2016-04-19 11:18:18,264 > > TokenMetadata.java (line 414) Updating topology for /192.168.88.33 > DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line > 102) Submitting migration task for /192.168.88.33 DEBUG [GossipStage:1] > 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting > migration task for /192.168.88.33 DEBUG [MigrationStage:1] 2016-04-19 > 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: > node /192.168.88.33 is down. > > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java > (line 62) Can't send schema pull request: node /192.168.88.33 is down. > > DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java > > (line 977) removing expire time for endpoint : /192.168.88.33 INFO > > [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line > > 978) InetAddress /192.168.88.33 is now UP DEBUG > > [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java > > (line 102) Submitting migration task for /192.168.88.33 DEBUG > > [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line > > 977) removing expire time for endpoint : /192.168.88.33 INFO > > [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line > > 978) InetAddress /192.168.88.33 is now UP DEBUG > [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java > (line 102) Submitting migration task for /192.168.88.33 DEBUG > [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977) > removing expire time for endpoint : /192.168.88.33 INFO > [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978) > InetAddress /192.168.88.33 is now UP DEBUG [RequestResponseStage:2] > 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting > migration task for /192.168.88.33 ..... > > > > > > On the otherhand, Node 1 keeps updating its gossip information, followed > by receiving and submitting migrationTask afterwards: > > DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java > > (line 977) removing expire time for endpoint : /192.168.88.34 INFO > > [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line > > 978) InetAddress /192.168.88.34 is now UP DEBUG > > [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line > > 977) removing expire time for endpoint : /192.168.88.34 INFO > > [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line > 978) InetAddress /192.168.88.34 is now UP DEBUG [RequestResponseStage:3] > 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for > endpoint : /192.168.88.34 INFO [RequestResponseStage:3] 2016-04-19 > 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now > UP ...... > > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 > MigrationRequestVerbHandler.java (line 41) Received migration request from / > 192.168.88.34. > > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595 > MigrationRequestVerbHandler.java (line 41) Received migration request from / > 192.168.88.34. > > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843 > MigrationRequestVerbHandler.java (line 41) Received migration request from / > 192.168.88.34. > > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878 > MigrationRequestVerbHandler.java (line 41) Received migration request from / > 192.168.88.34. > > ...... > > DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java > > (line 127) submitting migration task for /192.168.88.34 DEBUG > > [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line > > 127) submitting migration task for /192.168.88.34 DEBUG > [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) > submitting migration task for /192.168.88.34 ..... > > > > Has anyone experienced this scenario? Thanks in advanced! > > > > Sincerely, > > > > Michael Fong > > > > From: Michael Fong [mailto:michael.fong@ruckuswireless.com] > > Sent: Wednesday, April 20, 2016 10:43 AM > > To: user@cassandra.apache.org; > > dev@cassandra.apache.org > > Subject: Cassandra 2.0.x OOM during bootstrap > > > > Hi, all, > > > > We have recently encountered a Cassandra OOM issue when Cassandra is > brought up sometimes (but not always) in our 4-node cluster test bed. > > > > After analyzing the heap dump, we could find the Internal-Response > thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of > 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of > heap memory. > > > > According to the documents on internet, it seems internal-response > thread pool is somewhat related to schema-checking. Has anyone encountered > similar issue before? > > > > We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance! > > > > Sincerely, > > > > Michael Fong > --001a114bc2d4a374620532917a31--