hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11804) KMS client needs retry logic
Date Thu, 08 Jun 2017 17:41:18 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043108#comment-16043108
] 

Daryn Sharp commented on HDFS-11804:
------------------------------------

General issues:
* Attempts are not actually attempts, but retries (attempts + 1).  See below, only noticed
due to test.
* Suggest incrementing {{numFailovers}} in the for loop statement, instead of a conditional
in the middle of the loop, to be a little more clear about what's happening.
* The catch ACE should rethrow instead of "break"-ing since the latter causes it to log a
misleading line that it tried all the providers when it actually didn't.
* If {{shouldRetry}} were to throw an IOE, it will erroneously be re-wrapped in another IOE.
* The sleep condition is {{numFailovers >= providers.length}}. I think it makes more sense
as {{(numFailovers % providers.length) == 0}} to sleep only between sweeps of the kms cluster.

Test issues:
* {{testClientRetriesWithAccessControlException}} doesn't appear to test what it claims to
do.  The 1st provider throws IOE, 2nd throws ACE.  The test doesn't verify the ACE stopped
the retries.  It tests that the IOE did.
* {{LoadBalancingKMSClientProvider.KMS_FAILOVER_MAX_ATTEMPTS_KEY}} is really acting like max
retries. {{testClientRetriesSpecifiedNumberOfTimes}} shows that 10 attempts != 10.

> KMS client needs retry logic
> ----------------------------
>
>                 Key: HDFS-11804
>                 URL: https://issues.apache.org/jira/browse/HDFS-11804
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>         Attachments: HDFS-11804-trunk-1.patch, HDFS-11804-trunk-2.patch, HDFS-11804-trunk-3.patch,
HDFS-11804-trunk.patch
>
>
> The kms client appears to have no retry logic – at all.  It's completely decoupled
from the ipc retry logic.  This has major impacts if the KMS is unreachable for any reason,
including but not limited to network connection issues, timeouts, the +restart during an upgrade+.
> This has some major ramifications:
> # Jobs may fail to submit, although oozie resubmit logic should mask it
> # Non-oozie launchers may experience higher rates if they do not already have retry logic.
> # Tasks reading EZ files will fail, probably be masked by framework reattempts
> # EZ file creation fails after creating a 0-length file – client receives EDEK in
the create response, then fails when decrypting the EDEK
> # Bulk hadoop fs copies, and maybe distcp, will prematurely fail



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message