cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Morton (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (CASSANDRA-2290) Repair hangs if one of the neighbor is dead
Date Wed, 09 Mar 2011 18:33:06 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004707#comment-13004707
] 

Aaron Morton edited comment on CASSANDRA-2290 at 3/9/11 6:32 PM:
-----------------------------------------------------------------

Not sure if this helps. I found a place where AES was hanging while testing failure during
streaming transfer for CASSANDRA-2088 (against 0.7). I broke the FileStresmTask to only send
one range and close the sending channel. 

The  IncomingStreamReader.readFile() got stuck in an infinite loop because it does not check
the return from FileChannel.transferFrom(). It was returning 0 bytes read. Also the FileStreamTask
does not check the bytes sent by transferTo()

While stuck in the loop the socket it was reading from was (127.0.0.1 was in the loop, .0.2
was sending) 
java      25371 aaron   73u  IPv4 0xffffff8010742ff8      0t0  TCP 127.0.0.1:7000->127.0.0.2:52759
(CLOSE_WAIT)

When I was debugging the socketChannel was still reporting it was open. 

      was (Author: amorton):
    Not sure if this helps. I found a place where AES was hanging while testing failure during
streaming transfer for CASSANDRA-2088. I broke the FileStresmTask to only send one range and
close the sending channel. 

The  IncomingStreamReader.readFile() got stuck in an infinite loop because it does not check
the return from FileChannel.transferFrom(). It was returning 0 bytes read. Also the FileStreamTask
does not check the bytes sent by transferTo()

While stuck in the loop the socket it was reading from was (127.0.0.1 was in the loop, .0.2
was sending) 
java      25371 aaron   73u  IPv4 0xffffff8010742ff8      0t0  TCP 127.0.0.1:7000->127.0.0.2:52759
(CLOSE_WAIT)

When I was debugging the socketChannel was still reporting it was open. 
  
> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.4
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily
fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740
however. That is, if we add a way to query the state of a repair, and that this query correctly
check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message