hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5535) Umbrella jira for improved HDFS rolling upgrades
Date Mon, 24 Feb 2014 23:36:26 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910982#comment-13910982

Andrew Wang commented on HDFS-5535:

Hi all,

It looks like this feature is getting close, nice work! Can we get a rev of the design doc
at a less-high level as we approach merge? It seems like details surrounding e.g. user API
and implementation have been ironed out, so should be included. There are also (I believe
deprecated) mentions of lite-decom. It'd also be nice if someone could unify the section title
formatting, since there are a number of different parts (checkpoint/rollback, NN failover,
DN restart), and they each use their own formatting schemes. Namely, it'd be very helpful
to consistently number the section titles (most word processing apps can do this for you).

I also had a few questions after reading the doc, sorry in advance if these were already answered
in the comments:

* Can you expand on NN/DN consistency with the rollback marker and heartbeat notifications?
I'm not familiar with append or lease recovery, so it'd be nice to get more explanation on
those in particular.
* Could you comment on your experiences regarding the interval between an upgrade and finalize?
My impression was that right now, cluster operators might wait a long time before finalizing
to be safe (e.g. a week or two). Since checkpointing would be paused with the rollback marker,
a lot of edits would accumulate, and NN startup time would suffer.
* Big +1 to not changing the layout version any further in the 2.x line after this. With PB'd
metadata and feature flags (whenever they arrive), this makes NN upgrade a lot more pleasant.
We should also call this out on the Hadoop compatibility wiki page when this JIRA is merged
goes in.
* Can you comment on how riding out DN restarts interacts with the HBase MTTR work? I know
they've done a lot of work to reduce timeouts throughout the stack, and riding out restarts
sounds like we need to keep the timeouts up. It might help to specify your target restart
time, for example with a DN with 500k blocks.
* Are longer restarts (e.g. OS or hardware upgrade) part of the scope? Obviously, 1-repl blocks
would become an issue, and a super long timeout is not a good solution. Maybe this is just
the normal decom process needing love, but it'd be nice to address these longer maintenance
restarts too.

> Umbrella jira for improved HDFS rolling upgrades
> ------------------------------------------------
>                 Key: HDFS-5535
>                 URL: https://issues.apache.org/jira/browse/HDFS-5535
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, ha, hdfs-client, namenode
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Nathan Roberts
>         Attachments: HDFSRollingUpgradesHighLevelDesign.pdf, h5535_20140219.patch, h5535_20140220-1554.patch,
h5535_20140220b.patch, h5535_20140221-2031.patch
> In order to roll a new HDFS release through a large cluster quickly and safely, a few
enhancements are needed in HDFS. An initial High level design document will be attached to
this jira, and sub-jiras will itemize the individual tasks.

This message was sent by Atlassian JIRA

View raw message