Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 25B969331 for ; Sun, 4 Dec 2011 18:11:41 +0000 (UTC) Received: (qmail 56984 invoked by uid 500); 4 Dec 2011 18:11:39 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 56958 invoked by uid 500); 4 Dec 2011 18:11:39 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 56950 invoked by uid 99); 4 Dec 2011 18:11:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Dec 2011 18:11:39 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of potekhin@bnl.gov designates 130.199.3.132 as permitted sender) Received: from [130.199.3.132] (HELO smtpgw.bnl.gov) (130.199.3.132) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Dec 2011 18:11:28 +0000 X-BNL-policy-q: X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AoYFAJG3206CxzYH/2dsb2JhbABEhQWifYIugQWBcgEBBSMVQBELGAICBRYLAgIJAwIBAgFFEwgBAawLkG6BMIZcgX+BFgSILZF2jG8 X-IronPort-AV: E=Sophos;i="4.71,294,1320642000"; d="scan'208";a="156074563" Received: from rcf.rhic.bnl.gov ([130.199.54.7]) by smtpgw.sec.bnl.local with ESMTP/TLS/DHE-RSA-AES256-SHA; 04 Dec 2011 13:10:52 -0500 Received: from [192.168.0.196] (ool-18bde93d.dyn.optonline.net [24.189.233.61]) (authenticated bits=0) by rcf.rhic.bnl.gov (8.13.8/8.13.8) with ESMTP id pB4IApLH019296 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Sun, 4 Dec 2011 13:10:52 -0500 Message-ID: <4EDBB7A9.9060604@bnl.gov> Date: Sun, 04 Dec 2011 13:10:49 -0500 From: Maxim Potekhin User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Re: Repair failure under 0.8.6 References: <4EDAAF7E.40502@bnl.gov> <4EDACE10.6010804@bnl.gov> In-Reply-To: <4EDACE10.6010804@bnl.gov> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org I capped heap and the error is still there. So I keep seeing "node dead" messages even when I know the nodes were OK. Where and how do I tweak timeouts? 9d-cfc9-4cbc-9f1d-1467341388b8, endpoint /130.199.185.193 died INFO [GossipStage:1] 2011-12-04 00:26:16,362 Gossiper.java (line 683) InetAddress /130.199.185.193 is now UP ERROR [AntiEntropySessions:1] 2011-12-04 00:26:16,518 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Anti\ EntropySessions:1,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Problem during repair session manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, \ endpoint /130.199.185.194 died at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Problem during repair session manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, endpoint /130.199\ .185.194 died at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:712) at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:749) at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:155) at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:527) at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57) at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157) On 12/3/2011 8:34 PM, Maxim Potekhin wrote: > Thank you Peter. Before I look into details as you suggest, > may I ask what you mean "automatically restarted"? They way > the box and Cassandra are set up in my case is such that the > death of either if final. > > Also, how do I look for full GC? I just realized that in the latest > install, I might have omitted capping the heap size -- and the > nodes have 48GB each. I guess this could be a problem, precipitating > GC death, right? > > Thank you > > Maxim > > > On 12/3/2011 7:46 PM, Peter Schuller wrote: >>> quite understand how Cassandra declared a node dead (in the below). >>> Was is a >>> timeout? How do I fix that? >> I was about to respond to say that repair doesn't fail just due to >> failure detection, but this appears to have been broken by >> CASSANDRA-2433 :( >> >> Unless there is a subtle bug the exception you're seeing should be >> indicative that it really was considered Down by the node. You might >> grep the log for references ot the node in question (UP or DOWN) to >> confirm. The question is why though. I would check if the node has >> maybe automatically restarted, or went into full GC, etc. >>