Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Tue, 2 Jul 2013 09:51:23 +0000 (UTC)
From: "Sylvain Lebresne (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12654745.1372172271710.197696.1372758683149@arcas>
In-Reply-To: <JIRA.12654745.1372172271710@arcas>
References: <JIRA.12654745.1372172271710@arcas>
Subject: [jira] [Updated] (CASSANDRA-5699) Streaming (2.0) can deadlock
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/CASSANDRA-5699?page=3Dcom.atla=
ssian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-5699:
----------------------------------------

    Attachment: 5699.txt

bq. It's somewhat confusing that we use StreamInit both as leader and as fo=
llower, to mean different things

I guess, though if you see StreamInit as a way for both node to open/initia=
lize their outgoing connection to the other node (which is what this is doi=
ng), it's not really that different. And it's somewhat consistent with the =
rest of the protocol, where each side will sent a Prepare and then finally =
a Complete.  Anyway, updated the patch with the sentByInitiator renaming.

bq. What is going on with the switch from Set<UUID> ongoingSessions to Map<=
InetAddress, StreamSession> ongoingSessions in SRF?

ongoingSession only role initialy was to track how many sessions for a Stre=
amResultFuture were live to know when we're done (SRF.maybeComplete()). In =
the original patch of CASSANDRA-5286, it wasn't even there, and there was j=
ust an AtomicInteger decremented each time a session completed. I was sligh=
tly afraid that a race could have us counting the same session done twice, =
so I changed it to a Set<UUID> and added a simple UUID per session that was=
 only used for that purpose (it was never send to the other side or anythin=
g). Anyway, that UUID addition was stupid in the first place since for a SR=
F, there is only one StreamSession per remote host, so it should have been =
a Set<InetAddress> to start with and we have no use for a StreamSession UUI=
D.

But on top of that, this patch add the need to be able to find back a Strea=
mSession in a SRF given the remote endpoint (in IncomingStreamConnection, w=
hen the initiator gets the other side StreamInit), so we really need a map =
now.

bq. Suggest renaming, e.g. onSessionEstablished

Agreed, though I've renamed to onInitializationComplete to be consistent wi=
th the wording of the javadoc.

bq. Somewhat confused by logic in complete() =E2=80=93 "I received a Comple=
te message. If I'm already waiting for a complete message, close the sessio=
n. Otherwise, wait for [another] Complete message" ?

Each side will send a CompleteMessage to say "I'm done on my side", and the=
 session will be considered really complete (and closed) when we got the Co=
mplete for both side. So WAIT_COMPLETE means "I've seen only one complete m=
essage so far". Hence the logic of complete() is "I just received a complet=
e from the other side, if I had already completed my side, close the sessio=
n, otherwise move to the "I'm waiting for the other complete" state".

bq. It looks like there may be synchronization issues with "state;"

I think we're good, because the only case where we can race to set a state =
is during the completion phase, for WAIT_COMPLETE and COMPLETE, and those a=
re protected. For other states, there are set sequentially on each node by =
construction.

bq. Why is init broken out from construction?

SRF takes his StreamSession in his ctor so we can't have StreamSessions tak=
e their SRF without some other refactor.  Now don't get me wrong, it bugs m=
e too. And I do think there is a few things we can do to make that whole ne=
w streaming API simpler/cleaner (including probably renaming StreamRepairFu=
ture). And I'm volonteering to give to a shot to that. But in another follo=
w ticket because:
# I really think that kind of code cleanup shouldn't block beta1 (but this =
ticket, the fact that streaming deadlock, do is a blocker)
# I've already rewrote quite a bit of Yuki's code without him having the ch=
ange to chime in. Would feel fair to wait to him to be back before refactor=
ing parts that are not really crucial for beta1.

bq. Why do we include a String description in SIM?

This the description of what the operation is doing ("Repair", "Bootstrap",=
 "Restore replica count", ...)

bq. When does follower immediately have something to stream to leader

A StreamSession in the initiator handle both incoming and outgoing streams.=
 Sometimes we only have outgoing ones (Unbootstrap/Restore replica count/Bu=
lkloading), sometimes only
incoming ones (Bootstrap) and sometimes both (Repair indeed but also the SS=
.RangeRelocator for moves and vnodes tokens relocation). Does that answer y=
our question?

               =20
> Streaming (2.0) can deadlock
> ----------------------------
>
>                 Key: CASSANDRA-5699
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5699
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 2.0 beta 1
>
>         Attachments: 5699.txt
>
>
> The new streaming implementation (CASSANDRA-5286) creates 2 threads per h=
ost for streaming, one for the incoming stream and one for the outgoing one=
. However, both currently share the same socket, but since we use synchrono=
us I/O, a read can block a write, which can result in a deadlock if 2 nodes=
 are both blocking on a read a the same time, thus blocking their respectiv=
e writes (this is actually fairly easy to reproduce with a simple repair).
> So instead attaching a patch that uses one socket per thread.
> The patch also correct the stream throughput throttling calculation that =
was 8000 times lower than what it should be.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs
For more information on JIRA, see: http://www.atlassian.com/software/jira