hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4389) Non-HA DFSClients do not attempt reconnects
Date Fri, 11 Jan 2013 22:18:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551581#comment-13551581
] 

Daryn Sharp commented on HDFS-4389:
-----------------------------------

The problem was discovered while debugging random failures in {{TestPersistBlocks}}. The {{TestRestartDfsWithFlush}}
does the following:
# open a stream
# write 5 blocks
# flush
# wait for at least 1 block to be finalized, record size
# bounce the NN
# ensure file is at least as big as before bounce
# write 5 more blocks <- race condition blows up here
# close stream
# ensure all data is there

The problem occurs when {{DFSOutputStream.DataStreamer}} needs to call {{DFSClient#addBlock}}
while the NN is down.  It receives a {{ConnectException}} from the IPC layer, which isn't
handled so it stores it away and shuts down the stream.  The additional write to the stream
after the NN restart throws the stored connect exception.

The end result is streams cannot survive a NN restart or network interruption that lasts longer
than the time it takes to write a block.  The issue is probably general to all client methods.
                
> Non-HA DFSClients do not attempt reconnects
> -------------------------------------------
>
>                 Key: HDFS-4389
>                 URL: https://issues.apache.org/jira/browse/HDFS-4389
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha, hdfs-client
>    Affects Versions: 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Priority: Critical
>
> The HA retry policy implementation appears to have broken non-HA {{DFSClient}} connect
retries.  The ipc {{Client.Connection#handleConnectionFailure}} used to perform 45 connection
attempts, but now it consults a retry policy.  For non-HA proxies, the policy does not handle
{{ConnectException}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message