kudu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jdcry...@apache.org
Subject [2/2] kudu git commit: KUDU-1963. Avoid NPE and SSLException when a TabletClient is closed during negotiation
Date Tue, 04 Apr 2017 19:45:23 GMT
KUDU-1963. Avoid NPE and SSLException when a TabletClient is closed during negotiation

This fixes two related bugs which were triggered when a client called
close() on a TabletClient (e.g. because the end user of the API called

1) javax.net.ssl.SSLException: renegotiation attempted by peer; closing the connection

This was caused by the following interleaving in Netty:

T1                                          T2
------------------------                    -----------------
      - starts a TLS "shutdown" message
      - underlying SSLEngine is now in a
        "closing" state                     call Channel.write()
                                               call SSLEngine.write()
                                                 - gets NEEDS_UNWRAP
                                                 - interprets this as a
                                                   renegotiation and
                                                   throws an Exception

      - call SSLEngine.unwrap() to get the
        TLS message data to send

I wasn't able to write a unit test that triggered this reliably, since
TabletClient is too tightly coupled to ConnectionCache and
AsyncKuduClient. However, I was able to reproduce this reliably on an
Impala cluster by starting 100 threads sending short queries round-robin
to 9 impalads. Previously I saw the exception within a few seconds and
now I've run successfully for many minutes with no errors.

2) NullPointerException following above SSLException

When the above SSLException occurred, it called the
TabletClient#exceptionCaught() method, which called 'cleanup()' to fail
any outstanding RPCs. However, because the exception was being caught
because the client had already been explicitly closed, the
'rpcsInflight' and 'pendingRpcs' lists had already been set to null.
This caused an NPE as we tried to process those lists for a second time.

Another case which cased a similar NPE was that the NegotiationResult
could come after the close() had been called, in which case
'pendingRpcs' would already be null.

This patch fixes both issues as follows:
- if we've already explicitly closed the client, we ignore SSLExceptions
  rather than logging them. Actually avoiding this race seemed to be
  extremely complicated, so better to just ignore it.
- properly check whether the pendingRpcs/rpcsInFlight lists were already
  null before processing them.

Change-Id: Ide91722446d01acb58e3e181e0d10932e023a043
Reviewed-on: http://gerrit.cloudera.org:8080/6541
Tested-by: Kudu Jenkins
Reviewed-by: Jean-Daniel Cryans <jdcryans@apache.org>

Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/ac6e24fc
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/ac6e24fc
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/ac6e24fc

Branch: refs/heads/master
Commit: ac6e24fc2299430b5f70146324bef2d2e8a4b8bc
Parents: 2bc2c32
Author: Todd Lipcon <todd@apache.org>
Authored: Mon Apr 3 18:08:00 2017 -0700
Committer: Jean-Daniel Cryans <jdcryans@apache.org>
Committed: Tue Apr 4 19:45:04 2017 +0000

 .../org/apache/kudu/client/TabletClient.java    | 23 ++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/java/kudu-client/src/main/java/org/apache/kudu/client/TabletClient.java b/java/kudu-client/src/main/java/org/apache/kudu/client/TabletClient.java
index e26e883..24e80f8 100644
--- a/java/kudu-client/src/main/java/org/apache/kudu/client/TabletClient.java
+++ b/java/kudu-client/src/main/java/org/apache/kudu/client/TabletClient.java
@@ -35,6 +35,7 @@ import java.util.concurrent.RejectedExecutionException;
 import java.util.concurrent.locks.ReentrantLock;
 import javax.annotation.concurrent.GuardedBy;
+import javax.net.ssl.SSLException;
 import com.google.common.annotations.VisibleForTesting;
 import com.google.common.base.Preconditions;
@@ -358,7 +359,12 @@ public class TabletClient extends SimpleChannelUpstreamHandler {
       // Send the queued RPCs after dropping the lock, so we don't end up calling
       // their callbacks/errbacks with the lock held.
-      sendQueuedRpcs(queuedRpcs);
+      //
+      // queuedRpcs may be null in the case that we disconnected the client just
+      // at the same time as the Negotiator was finishing up.
+      if (queuedRpcs != null) {
+        sendQueuedRpcs(queuedRpcs);
+      }
     if (!(m instanceof CallResponse)) {
@@ -633,12 +639,14 @@ public class TabletClient extends SimpleChannelUpstreamHandler {
       // for negotiation to complete.
       if (pendingRpcs != null) {
+        pendingRpcs = null;
-      pendingRpcs = null;
       // Similarly, we need to fail any that were already sent and in-flight.
-      rpcsToFail.addAll(rpcsInflight.values());
-      rpcsInflight = null;
+      if (rpcsInflight != null) {
+        rpcsToFail.addAll(rpcsInflight.values());
+        rpcsInflight = null;
+      }
     } finally {
@@ -702,6 +710,13 @@ public class TabletClient extends SimpleChannelUpstreamHandler {
       if (!closedByClient) {
         LOG.info(getPeerUuidLoggingString() + "Lost connection to peer");
+    } else if (e instanceof SSLException && closedByClient) {
+      // There's a race in Netty where, when we call Channel.close(), it tries
+      // to send a TLS 'shutdown' message and enters a shutdown state. If another
+      // thread races to send actual data on the channel, then Netty will get a
+      // bit confused that we are trying to send data and misinterpret it as a
+      // renegotiation attempt, and throw an SSLException. So, we just ignore any
+      // SSLException if we've already attempted to close.
     } else {
       LOG.error(getPeerUuidLoggingString() + "Unexpected exception from downstream on " +
c, e);

View raw message