impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Henry Robinson (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3977: TransmitData() should not block
Date Wed, 14 Dec 2016 19:44:32 GMT
Henry Robinson has posted comments on this change.

Change subject: IMPALA-3977: TransmitData() should not block

Patch Set 1:

File be/src/runtime/

Do you think we still need the stream cache? In the case where a receiver has completed before
it receives an initial RPC, we should be able to rely on cancellation being detected on the
sender side. The cost is that we might retry that initial RPC a couple of times before we
receive cancellation from the coordinator.

Can you think this through and work out whether there's a state where the stream cache is
really worth the extra complexity?

PS1, Line 99: if (acquire_lock) lock_.lock();
Can we just require now that the caller takes the lock?
File be/src/runtime/

PS1, Line 209: TTransmitDataResult res;
Let's move this inside the loop so it's freshly initialized on each iteration. That avoids
any bugs where one RPC returns EAGAIN and sets a field in the result that - for some reason
- a subsequent successful RPC does not overwrite.

PS1, Line 217: DoTransmitDataRpc
Since a row batch can potentially be many megabytes large, I wonder if it makes sense to have
an initial 'probe' RPC that has no payload, but just confirms that the receiver is ready.
Then we can limit the retry logic to that probe only, and if a subsequent RPC returns EAGAIN,
we know that the receiver is closed and don't need to retry.

Line 224:     if (res.status.status_code != TErrorCode::DATASTREAM_RECEIVER_EAGAIN) break;
Is cancellation detected during this loop? That's part of the point of moving control back
to the sender on EAGAIN, so that cancellation can be detected. Otherwise this blocks from
the sender's perspective in the same way that it did before.

PS1, Line 225: VLOG_RPC << "Datastream receiver not yet ready. Retrying";
Not very informative without the fragment ID / destination address etc.

PS1, Line 226: FLAGS_datastream_sender_timeout_ms
rather than have the sleep period be a function of the total timeout, it makes more sense
to have a fixed retry interval and exit this loop if the timeout has been exceeded.

For example, if someone sets the timeout to ten minutes (because they're being ultra conservative
on a heavy cluster), this will retry every two minutes.

Line 228: 
I think it's a bit confusing to have this method return EAGAIN upon failure. Can you replace
the status with timeout if it times-out of the loop?

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I47f9d6c1de95a72ab73471dcd2a3118ef13e7db0
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <>
Gerrit-Reviewer: Henry Robinson <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Sailesh Mukil <>
Gerrit-HasComments: Yes

View raw message