hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5535) Umbrella jira for improved HDFS rolling upgrades
Date Tue, 25 Feb 2014 22:21:24 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13912153#comment-13912153
] 

Kihwal Lee commented on HDFS-5535:
----------------------------------

{quote}
[~andrew.wang] : Can you comment on how riding out DN restarts interacts with the HBase MTTR
work? I know they've done a lot of work to reduce timeouts throughout the stack, and riding
out restarts sounds like we need to keep the timeouts up. It might help to specify your target
restart time, for example with a DN with 500k blocks.
{quote}

There are multiple factors that affect the DN upgrade latency. First, shutdown can take up
to about 2 seconds. The time is bounded so that slow clients/network cannot delay the shutdown.
That is, the shutdown notification (via the OOB acks) is advisory and may not reach all clients.
HDFS does not provide a integrated tool for software/config upgrade & restart. The process
will be different depending on how hadoop and its config is deployed. Whatever the method
is, it will add a bit of delay to the whole process.

The actual DN startup time varies depending on the number of blocks and the number of disks
(typically corresponds to # of volumes). Before HDFS-5498, DN would run "du" on each volume
serially and also list and stat block and meta files serially. Now "du" doesn't have to run
if restarted within 5 minutes of shutdown. Also block/meta scans are done concurrently if
multiple volumes are present.  So, the startup time will be shorter if blocks are spread across
more disks. In a test carried out in the beginning of HDFS-5498, the restart time of a node
shrunk from over a minute to about 12 seconds. We could make restart even shorter by saving
more state on shutdown.

The final step before being able to serve client is the registration. If the block token is
used, DNs won't be able to validate them until registration with NN is complete. This usually
happens immediately after scanning and adding blocks. We could save the shared secret before
shutdown, but obviously it can be a security risk.

I will be more than happy to further improve the DN restart latency, if it becomes major hurdle
for HBase.

{quote}
[~stack] :  + "3. Clients" It says "For shutdown, clients may choose to copy partially written
replica to another node..." The DFSClient would do this internally (or 'may' do this)?
{quote}

This is done internally. This is what happens to clients with # replicas >= 3 today when
node failures leave only (#replica / 2) nodes in the pipeline. If the restarting node is not
local or more than only nodes are in the pipeline, the client will simply exclude the restarting
node and , if necessary, add more nodes to the pipeline and copy partially written data over
to the new node.  If the restarting node is the local node or is the only node in the pipeline,
client will wait for the node to be restarted.  There is a configurable timeout for the wait.

If the restarting node was the only node in the pipeline, the write will fail after the restart
timeout. This is very unlikely to blocks with 3 or more replicas due to the DN replacement
policy. For blocks being written with replication factor of 2 can, however, suffer, if the
writer has already lost one node due to a failure and loses another because of restart failure.
To address this issue, HDFS-6016 has been filed. However, this is a non-blocking issue for
the rolling upgrade feature.

Regarding use of shortened read/write socket timeout, I can see the use cases favoring failing
over to a slower node than waiting longer for the same fail-over or recovery, if lucky.  Reads
will fail over to a different source during  a DN restart. For writes, the restart timeout
is independent of the socket write timeout, so the client may block for more than what the
user wants to.  If that is undesirable, the configuration should be changed so that DFSOutputStream
can timeout on restart much quicker. The default is currently 30 seconds.

> Umbrella jira for improved HDFS rolling upgrades
> ------------------------------------------------
>
>                 Key: HDFS-5535
>                 URL: https://issues.apache.org/jira/browse/HDFS-5535
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, ha, hdfs-client, namenode
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Nathan Roberts
>         Attachments: HDFSRollingUpgradesHighLevelDesign.pdf, h5535_20140219.patch, h5535_20140220-1554.patch,
h5535_20140220b.patch, h5535_20140221-2031.patch, h5535_20140224-1931.patch, h5535_20140225-1225.patch
>
>
> In order to roll a new HDFS release through a large cluster quickly and safely, a few
enhancements are needed in HDFS. An initial High level design document will be attached to
this jira, and sub-jiras will itemize the individual tasks.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message