hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4257) The ReplaceDatanodeOnFailure policies could have a forgiving option
Date Wed, 23 Jul 2014 20:56:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072325#comment-14072325
] 

Colin Patrick McCabe commented on HDFS-4257:
--------------------------------------------

Nicholas, thank you for looking at this.  I can tell there have been a lot of JIRAs about
this problem (HDFS-3091, HDFS-3179, HDFS-5131, and HDFS-4600 are all somewhat related).

The basic problem that seems to happen a lot is:
1. Client loses network connectivity
2. The client tries to write.  But because it can't see anyone else in the network, it can
only write to 1 replica at most.
3. The pipeline recovery code throws a hard error because it can't get 3 replicas.
4. Client gets a write error and tries to close the file.  That just gives another error.
 The client goes into a bad state.  Sometimes the client continues trying to close the file
and continues getting an exception (although this behavior was changed recently).  Due to
HDFS-4504, the file never gets cleaned up on the NameNode if the client is long-lived.

HBase and Flume are both long-lived clients that have the problem with HDFS-4504.  HBase avoids
this particular problem by not using the HDFS pipeline recovery code, but simply doing their
own thing by checking the current number of replicas.  So they never get to step #3 because
the pipeline recovery is turned off.  For Flume, though, this is a major problem.

The approach in this patch seems to be that instead of throwing a hard error in step #3, the
DFSClient should simply accept only having 1 replica.  This will certainly fix the problem
for Flume.  But imagine the following scenario:
1. Client loses network connectivity
2. The client tries to write.  But because it can't see anyone else in the network, it can
only write to 1 replica at most.
3. The pipeline recovery code accepts only using 1 local replica
4. The client gets network connectivity back
5. A long time passes
6. The hard disks on the client node go down.

In this scenario, we lose the data after step #6.  The problem is that while the latest replica
is under construction, we won't try to replicate it to other nodes, even though the network
is back.

If we had a background thread that tried to repair the pipeline in step #5, we could avoid
this problem.  Another possibility is that instead of throwing an error or continuing in step
#3, we could simply wait for a configurable period (after logging a message).

> The ReplaceDatanodeOnFailure policies could have a forgiving option
> -------------------------------------------------------------------
>
>                 Key: HDFS-4257
>                 URL: https://issues.apache.org/jira/browse/HDFS-4257
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: hdfs-client
>    Affects Versions: 2.0.2-alpha
>            Reporter: Harsh J
>            Assignee: Tsz Wo Nicholas Sze
>            Priority: Minor
>         Attachments: h4257_20140325.patch, h4257_20140325b.patch, h4257_20140326.patch
>
>
> Similar question has previously come over HDFS-3091 and friends, but the essential problem
is: "Why can't I write to my cluster of 3 nodes, when I just have 1 node available at a point
in time.".
> The policies cover the 4 options, with {{Default}} being default:
> {{Disable}} -> Disables the whole replacement concept by throwing out an error (at
the server) or acts as {{Never}} at the client.
> {{Never}} -> Never replaces a DN upon pipeline failures (not too desirable in many
cases).
> {{Default}} -> Replace based on a few conditions, but whose minimum never touches
1. We always fail if only one DN remains and none others can be added.
> {{Always}} -> Replace no matter what. Fail if can't replace.
> Would it not make sense to have an option similar to Always/Default, where despite _trying_,
if it isn't possible to have > 1 DN in the pipeline, do not fail. I think that is what
the former write behavior was, and what fit with the minimum replication factor allowed value.
> Why is it grossly wrong to pass a write from a client for a block with just 1 remaining
replica in the pipeline (the minimum of 1 grows with the replication factor demanded from
the write), when replication is taken care of immediately afterwards? How often have we seen
missing blocks arise out of allowing this + facing a big rack(s) failure or so?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message