cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Potekhin <potek...@bnl.gov>
Subject Re: Repair failure under 0.8.6
Date Sun, 04 Dec 2011 18:10:49 GMT
I capped heap and the error is still there. So I keep seeing "node dead"
messages even when I know the nodes were OK. Where and how do I tweak
timeouts?


9d-cfc9-4cbc-9f1d-1467341388b8, endpoint /130.199.185.193 died
  INFO [GossipStage:1] 2011-12-04 00:26:16,362 Gossiper.java (line 683) 
InetAddress /130.199.185.193 is now UP
ERROR [AntiEntropySessions:1] 2011-12-04 00:26:16,518 
AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
Thread[Anti\
EntropySessions:1,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Problem during repair 
session manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, \
endpoint /130.199.185.194 died
         at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
         at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
         at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Problem during repair session 
manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, endpoint /130.199\
.185.194 died
         at 
org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:712)
         at 
org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:749)
         at 
org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:155)
         at 
org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:527)
         at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57)
         at 
org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157)


On 12/3/2011 8:34 PM, Maxim Potekhin wrote:
> Thank you Peter. Before I look into details as you suggest,
> may I ask what you mean "automatically restarted"? They way
> the box and Cassandra are set up in my case is such that the
> death of either if final.
>
> Also, how do I look for full GC? I just realized that in the latest
> install, I might have omitted capping the heap size -- and the
> nodes have 48GB each. I guess this could be a problem, precipitating
> GC death, right?
>
> Thank you
>
> Maxim
>
>
> On 12/3/2011 7:46 PM, Peter Schuller wrote:
>>> quite understand how Cassandra declared a node dead (in the below). 
>>> Was is a
>>> timeout? How do I fix that?
>> I was about to respond to say that repair doesn't fail just due to
>> failure detection, but this appears to have been broken by
>> CASSANDRA-2433 :(
>>
>> Unless there is a subtle bug the exception you're seeing should be
>> indicative that it really was considered Down by the node. You might
>> grep the log for references ot the node in question (UP or DOWN) to
>> confirm. The question is why though. I would check if the node has
>> maybe automatically restarted, or went into full GC, etc.
>>


Mime
View raw message