hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5924) Utilize OOB upgrade message processing for writes
Date Mon, 24 Feb 2014 02:11:19 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909974#comment-13909974

Kihwal Lee commented on HDFS-5924:

h3. Batch upgrades and upgrade speed
bq. Shutting down a datanode after previous shutdown datanode is up might be able to minimize
the write failure but would increase the total upgrade time.

The upgrade batch size has come up multiple times in the past discussions. The conclusion
is that DN rolling upgrades shouldn't be done in batch in order to minimize data loss. When
upgrading a big cluster in the traditional way, a number of DNs don't come back online, mainly
due to latent hardware issues. This results in missing blocks and manual recovery is necessary.
Most cases are recoverable by fixing hardware and manually copying block files, but not all.
Sometimes a number of blocks are permanently lost.  Since the block placement policy is not
sophisticated enough to consider failure domains, any several simultaneous permanent node
failures can cause this.

During DN rolling upgrades, admins should watch out for data availability issues caused by
failed restart. Data loss from permanent failures can only be avoided if only 1-2 nodes are
upgraded at a time. If the normal failure rate of non-upgrading nodes, one soon realizes that
upgrading one at a time is the preferred way.

The upgrade timing requirement (req-3) in the design doc was specified based on the serial
DN upgrade scenario for this reason.

Regarding the upgrade speed, it largely depends on the number of blocks on each DN. The restart
of a DN with about half million blocks in 4-6 volumes used to take minutes. After a number
of optimizations and HDFS-5498, it can come back up in about 10 seconds.  

h3. Write pipeline recovery during DN upgrades
There are three things a client can do when a node in the pipeline is being upgraded/restarted:
# Exclude the node and continue: this is just like the regular pipeline recovery. If the pipeline
has enough number of nodes, clients can do this. Even if there is no OOB ack, majority of
writers will survive this way as long as they don't hit double/triple failures and upgrades
combined. But single replica writes will fail.
# Copy block from the upgrading node to a new one and continue: this is a variation of the
"add additional node" recovery path of the pipeline recovery. The client will ask for an additional
node from NN and then issue a copy command with the source as the upgrading node.  This does
not add any value unless there is only one nodes in the pipeline. Writes with only one replica
will fail if the copy cannot be carried out.
# Wait until the node comes back: this is an alternative to the above approach. It avoids
extra data movement and maintains the locality. I've talked to HBase guys and they said they
would prefer this.  Writes with only one replica will fail if the restart does not happen
in time.

If min_replica is set to 2 for a cluster, (1) may take care of service availability side of
the issue. I.e. all writes succeeds if nodes are upgraded one by one.  There are use cases,
however, require min_replica to be 1, so we need a way to make single replica writes survive

A single replica write may be a write that was initiated with one replica from the beginning
or may be a result of node failures during the write. The former is usually considered okay
to fail, since users usually understand the risk of the replication factor of 1 and make the
choice.  The more problematic case is the latter; the ones started with more than one replicas
but currently having only one due to node failures.  I believe (2) or (3) can solve most of
these cases, but neither is ideal.

In theory, (2) sounds useful, but there are some difficulties in realizing it. First, a DN
has to stop accepting new requests, yet allow copy requests to be allowed. This means DataXceiverServer's
server socket cannot be closed until copy requests are received. Since the knowledge about
a specific client (writer) is only known to each DataXceiver thread and the copy command will
spawn another thread on DN, coordinating this is not simple.  Secondly, DN should be able
to detect when it is safe to shutdown.  To get it correct, it has to be told by the clients.
A slow client can also slow down the progress. In the end, the extra coordination and dependencies
are not exactly cheap.

The mechanism in (3) works fine with no additional run-time traffic overhead and no dependencies
on clients. Timing-wise it is also more deterministic. The downside is that if the datanode
does not come back up in time, outstanding writes will timeout and fail, if the node was the
only one left in the pipeline.  In addition to making writes continue, this mechanism allows
locality to be preserved, so (2) cannot substitute (3) completely.

I suggest an additional change to be made in order to keep the mechanism simpler, but address
its deficiency.

h3. Suggested improvement
While using (3) for write pipeline recovery during upgrade-restarts, we can reduce the chance
of permanent failures for writers with the replica count reduced to one due to prior node
failures. If a block write started with no replication (ie. single replica) by the user, it
is assumed that the user understands the risk of higher possibility of read or write failures.
Thus we focus on the cases where the specified replication factor is greater than one.

The datanode replacement policy is defined in {{ReplaceDatanodeOnFailure}}. If the default
policy is modified to maintain two replicas at minimum for r > 1, the chance of write failures
for r > 1 during DN rolling upgrades will be greatly reduced.  Note that this does not
incur additional replication traffic.   

If you think this proposal makes sense, I will file a separate jira to update {{ReplaceDatanodeOnFailure}}.

> Utilize OOB upgrade message processing for writes
> -------------------------------------------------
>                 Key: HDFS-5924
>                 URL: https://issues.apache.org/jira/browse/HDFS-5924
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, ha, hdfs-client, namenode
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>         Attachments: HDFS-5924_RBW_RECOVERY.patch, HDFS-5924_RBW_RECOVERY.patch
> After HDFS-5585 and HDFS-5583, clients and datanodes can coordinate shutdown-restart
in order to minimize failures or locality loss.
> In this jira, HDFS client is made aware of the restart OOB ack and perform special write
pipeline recovery. Datanode is also modified to load marked RBW replicas as RBW instead of
RWR as long as the restart did not take long. 
> For clients, it considers doing this kind of recovery only when there is only one node
left in the pipeline or the restarting node is a local datanode.  For both clients and datanodes,
the timeout or expiration is configurable, meaning this feature can be turned off by setting
timeout variables to 0.

This message was sent by Atlassian JIRA

View raw message