hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sanjay Radia (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1064) NN Availability - umbrella Jira
Date Wed, 24 Mar 2010 00:54:27 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848988#action_12848988
] 

Sanjay Radia commented on HDFS-1064:
------------------------------------

Some of the comments in the other Jiras have suggested that Yahoo has been working on only
scalability and not availability.  Both availability and scalability are important issues
for us. Most folks equate availability with automatic failover; but there  is more to availability
than failover.  The original purpose for HDFS was to support a batch processing system. This
allowed one to rely on restart since batch jobs can be delayed. However the SLAs requirements
for batch jobs are getting tighter. Further Hadoop is beginning to be used for near online
or online services. 

Below is some of the work that has happened in improving the availability of the NN and in
moving  towards automatic failover. (some of these are in release 20 and others in trunk).


*  We have made a lot of progress in restarting a HDFS cluster. Two years ago, the restart
time for a 2K cluster at Yahoo was several hours; one had to start 100 DNs at a time whenever
the NN was rebooted. In trunk we have measured the restart time for a 3K cluster to be 30
minutes.  Reducing restart time is important for failover: cold/warm failover performs all
or part of the restart.  Some of the steps we took:
** reducing time to load fsImage and editlogs; you will see more of this in the next few months.
** reduce the cost of a block report - the initial block report is needed for the NN to start
providing service.  Also we can safely restart the NN and deal with 3K initial block reports
in our clusters.
Facebook's internal patch puts block reports and heartbeats on a separate port - I understand
that this has helped the start up time.
* A major step towards HA was adding the backup namenode which synchronously gets the edit
logs. This work needs to be extended to do an actual failiover. We are exploring manual failover
using this backup NN and later doing an automatic failover using Zookeeper.  There is also
on going work on integrating bookkeeper with the NN. (I will explain the tradeoffs of the
Backup NNs vs the bookkeeper in a future comment).



> NN Availability - umbrella Jira
> -------------------------------
>
>                 Key: HDFS-1064
>                 URL: https://issues.apache.org/jira/browse/HDFS-1064
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Sanjay Radia
>
> This is an umbrella jira for discussing availability of the HDFS NN and providing references
to other Jiras that improve its availability. This includes, but is not limited to, automatic
failover. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message