cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sankalp kohli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8815) Race in sstable ref counting during streaming failures
Date Tue, 17 Feb 2015 18:36:12 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324637#comment-14324637
] 

sankalp kohli commented on CASSANDRA-8815:
------------------------------------------

+1
Looks good. 

>  Race in sstable ref counting during streaming failures 
> --------------------------------------------------------
>
>                 Key: CASSANDRA-8815
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8815
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: sankalp kohli
>            Assignee: Benedict
>             Fix For: 2.0.13
>
>         Attachments: 8815.txt
>
>
> We have a seen a machine in Prod whose all read threads are blocked(spinning) on trying
to acquire the reference lock on stables. There are also some stream sessions which are doing
the same. 
> On looking at the heap dump, we could see that a live sstable which is part of the View
has a ref count = 0. This sstable is also not compacting or is part of any failed compaction.

> On looking through the code, we could see that if ref goes to zero and the stable is
part of the View, all reader threads will spin forever. 
> On further looking through the code of streaming, we could see that if StreamTransferTask.complete
is called after closeSession has been called due to error in OutgoingMessageHandler, it will
double decrement the ref count of an sstable. 
> This race can happen and we see through exception in logs that closeSession was triggered
by OutgoingMessageHandler. 
> The fix for this is very simple i think. In StreamTransferTask.abort, we can remove a
file from "files” before decrementing the ref count. This will avoid this race. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message