cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-3548) NPE in AntiEntropyService$RepairSession.completed()
Date Thu, 01 Dec 2011 12:04:40 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sylvain Lebresne updated CASSANDRA-3548:
----------------------------------------

    Attachment: 3548-v2.patch

We also have the same race in rendezvous with the jobs. Attaching v2 that fix both.
                
> NPE in AntiEntropyService$RepairSession.completed()
> ---------------------------------------------------
>
>                 Key: CASSANDRA-3548
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3548
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.1
>         Environment: Free BSD 8.2, JVM vendor/version: OpenJDK 64-Bit Server VM/1.6.0
>            Reporter: Aaron Morton
>            Assignee: Aaron Morton
>            Priority: Minor
>         Attachments: 0001-3548.patch, 3548-v2.patch
>
>
> This may be related to CASSANDRA-3519 (cluster it was observed on is still 1.0.1), however
i think there is still a race condition.
> Observed on a 2 DC cluster, during a repair that spanned the DC's.  
> {noformat}
> INFO [AntiEntropyStage:1] 2011-11-28 06:22:56,225 StreamingRepairTask.java (line 136)
[streaming task #69187510-1989-11e1-0000-5ff37d368cb6] Forwarding streaming repair of 8602

> ranges to /10.6.130.70 (to be streamed with /10.37.114.10)
> ...
>  INFO [AntiEntropyStage:66] 2011-11-29 11:20:57,109 StreamingRepairTask.java (line 253)
[streaming task #69187510-1989-11e1-0000-5ff37d368cb6] task succeeded
> ERROR [AntiEntropyStage:66] 2011-11-29 11:20:57,109 AbstractCassandraDaemon.java (line
133) Fatal exception in thread Thread[AntiEntropyStage:66,5,main]
> java.lang.NullPointerException
>         at org.apache.cassandra.service.AntiEntropyService$RepairSession.completed(AntiEntropyService.java:712)
>         at org.apache.cassandra.service.AntiEntropyService$RepairSession$Differencer$1.run(AntiEntropyService.java:912)
>         at org.apache.cassandra.streaming.StreamingRepairTask$2.run(StreamingRepairTask.java:186)
>         at org.apache.cassandra.streaming.StreamingRepairTask$StreamingRepairResponse.doVerb(StreamingRepairTask.java:255)
>         at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> {noformat}
> One of the nodes involved in the repair session failed, e.g. (Not sure if this is from
the same repair session as the streaming task above, but it illustrates the issue)
> {noformat}
> ERROR [AntiEntropySessions:1] 2011-11-28 19:39:52,507 AntiEntropyService.java (line 688)
[repair #2bf19860-197f-11e1-0000-5ff37d368cb6] session completed with the following error
> java.io.IOException: Endpoint /10.29.60.10 died
>         at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:725)
>         at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:762)
>         at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:192)
>         at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559)
>         at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62)
>         at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> ERROR [GossipTasks:1] 2011-11-28 19:39:52,507 StreamOutSession.java (line 232) StreamOutSession
/10.29.60.10 failed because {} died or was restarted/removed
> ERROR [GossipTasks:1] 2011-11-28 19:39:52,571 Gossiper.java (line 172) Gossip error
> java.util.ConcurrentModificationException
>         at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:782)
>         at java.util.ArrayList$Itr.next(ArrayList.java:754)
>         at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:190)
>         at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559)
>         at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62)
>         at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> {noformat}
> When a node is marked as failed AntiEntropyService.RepairSession.forceShutdown() clears
the activejobs map. But the jobs to other nodes will continue, and will eventually call completed().

> RepairSession.terminated should stop completed() from checking the map, but there is
a race between the map been cleared and if there is an error in finally block it wont be set.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message