cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omid Aladini (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9458) Race condition causing StreamSession to get stuck in WAIT_COMPLETE
Date Wed, 27 May 2015 12:04:17 GMT


Omid Aladini commented on CASSANDRA-9458:

Thanks for checking the log and the patch. You're right as all the relevant calls to maybeCompleted
are synchronised on the object.

Do you have secondary indexes? Right now, streaming is considered completed after secondary
indexes are built in that finalise phase(CASSANDRA-9308).

There are secondary indexes and I see a bunch of "submitting index build of" in the full log
so I guess it's possible that the index build is just taking longer than the timeout. I'll
disable the timeout (and enable TCP keep-alive via CASSANDRA-9455) to see if it gets resolved.

> Race condition causing StreamSession to get stuck in WAIT_COMPLETE
> ------------------------------------------------------------------
>                 Key: CASSANDRA-9458
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Omid Aladini
>            Assignee: Omid Aladini
>            Priority: Critical
>             Fix For: 2.1.x, 2.0.x
>         Attachments: 9458-v1.txt
> I think there is a race condition in StreamSession where one side of the stream could
get stuck in WAIT_COMPLETE although both have sent COMPLETE messages. Consider a scenario
that node B is being bootstrapped and it only receives files during the session:
> 1- During a stream session A sends some files to B and B sends no files to A.
> 2- Once B completes the last task (receiving), StreamSession::maybeComplete is invoked.
> 3- While B is sending the COMPLETE message via StreamSession::maybeComplete, it also
receives the COMPLETE message from A and therefore StreamSession::complete() is invoked.
> 4- Therefore both maybeComplete() and complete() functions have branched into the state
!= State.WAIT_COMPLETE case and both set the state to WAIT_COMPLETE.
> 5- Now B is waiting to receive COMPLETE although it's already received it and nothing
triggers checking the state again, until it times out after streaming_socket_timeout_in_ms.
> In the log below:
> although the node has received COMPLETE, "SocketTimeoutException" is thrown after streaming_socket_timeout_in_ms
(30 minutes here).

This message was sent by Atlassian JIRA

View raw message