Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of potekhin@bnl.gov designates
 130.199.3.132 as permitted sender)
Message-ID: <4EDBB7A9.9060604@bnl.gov>
Date: Sun, 04 Dec 2011 13:10:49 -0500
From: Maxim Potekhin <potekhin@bnl.gov>
User-Agent: Mozilla/5.0 (Windows NT 6.0;
 rv:8.0) Gecko/20111105 Thunderbird/8.0
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Re: Repair failure under 0.8.6
References: 
 <CACCYQcyG2rmJJzxQ_5YYYp8itxvX1pjey-dtXn1MS35dqjL2Lw@mail.gmail.com>
 <4EDAAF7E.40502@bnl.gov>
 <CAO5xsd3c8jk3BzKj2tsJXziKjO0PtJp53AiHj==+ZvC796JH1w@mail.gmail.com>
 <4EDACE10.6010804@bnl.gov>
In-Reply-To: <4EDACE10.6010804@bnl.gov>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

I capped heap and the error is still there. So I keep seeing "node dead"
messages even when I know the nodes were OK. Where and how do I tweak
timeouts?


9d-cfc9-4cbc-9f1d-1467341388b8, endpoint /130.199.185.193 died
  INFO [GossipStage:1] 2011-12-04 00:26:16,362 Gossiper.java (line 683) 
InetAddress /130.199.185.193 is now UP
ERROR [AntiEntropySessions:1] 2011-12-04 00:26:16,518 
AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
Thread[Anti\
EntropySessions:1,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Problem during repair 
session manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, \
endpoint /130.199.185.194 died
         at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
         at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
         at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Problem during repair session 
manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, endpoint /130.199\
.185.194 died
         at 
org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:712)
         at 
org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:749)
         at 
org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:155)
         at 
org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:527)
         at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57)
         at 
org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157)


On 12/3/2011 8:34 PM, Maxim Potekhin wrote:
> Thank you Peter. Before I look into details as you suggest,
> may I ask what you mean "automatically restarted"? They way
> the box and Cassandra are set up in my case is such that the
> death of either if final.
>
> Also, how do I look for full GC? I just realized that in the latest
> install, I might have omitted capping the heap size -- and the
> nodes have 48GB each. I guess this could be a problem, precipitating
> GC death, right?
>
> Thank you
>
> Maxim
>
>
> On 12/3/2011 7:46 PM, Peter Schuller wrote:
>>> quite understand how Cassandra declared a node dead (in the below). 
>>> Was is a
>>> timeout? How do I fix that?
>> I was about to respond to say that repair doesn't fail just due to
>> failure detection, but this appears to have been broken by
>> CASSANDRA-2433 :(
>>
>> Unless there is a subtle bug the exception you're seeing should be
>> indicative that it really was considered Down by the node. You might
>> grep the log for references ot the node in question (UP or DOWN) to
>> confirm. The question is why though. I would check if the node has
>> maybe automatically restarted, or went into full GC, etc.
>>