hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11804) KMS client needs retry logic
Date Mon, 05 Jun 2017 14:00:10 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16036988#comment-16036988

Daryn Sharp commented on HDFS-11804:

Minor comments:
# I'm not sure about a multiplier.  Let's say I want to retry for a certain amount of time.
 I do the math and decide on 10 retries.  I have 2 kms servers so multiplier is 5.  Then the
bank is expanded to 8, causing 40 retries.  Or even the reverse, shrink the bank and now client
retries less than expected.  Should just specify a max retries that defaults to the number
of load balanced hosts (today's behavior) is probably better.
# The max sleep time seems rather high since it can block server threads for a long time.
 I'd suggest  perhaps 1-2s.
# Should probably only sleep after trying all clients instead of between each client.  If
there's a refused connection because a kms server is down, there's little use in waiting to
try the next.
# Separate try blocks for shouldRetry and sleep to better control exception rethrows.
# Change the sleep catch to rethrow {{InteruptedIOException}}.
# Trivial but there's a lot of double punctuation, ex. periods, exclamations.

> KMS client needs retry logic
> ----------------------------
>                 Key: HDFS-11804
>                 URL: https://issues.apache.org/jira/browse/HDFS-11804
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>         Attachments: HDFS-11804-trunk-1.patch, HDFS-11804-trunk.patch
> The kms client appears to have no retry logic – at all.  It's completely decoupled
from the ipc retry logic.  This has major impacts if the KMS is unreachable for any reason,
including but not limited to network connection issues, timeouts, the +restart during an upgrade+.
> This has some major ramifications:
> # Jobs may fail to submit, although oozie resubmit logic should mask it
> # Non-oozie launchers may experience higher rates if they do not already have retry logic.
> # Tasks reading EZ files will fail, probably be masked by framework reattempts
> # EZ file creation fails after creating a 0-length file – client receives EDEK in
the create response, then fails when decrypting the EDEK
> # Bulk hadoop fs copies, and maybe distcp, will prematurely fail

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message