hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-306) Safe mode and name node startup procedures
Date Fri, 11 Aug 2006 21:56:14 GMT
     [ http://issues.apache.org/jira/browse/HADOOP-306?page=all ]

Konstantin Shvachko updated HADOOP-306:

    Attachment: FSImageSaveDNInfo.patch

This patch implements a part of the laid out design.
It stores the historical list of datanodes in the image file and logs newly
registered nodes in the edits file.
The changes are mostly related to the FSNamesystem class.
Since datanodes are not removed from the datanodeMap when they are considered
non responsive, the deadDatanodeMap field becomes redundant. I removed it.
There is a change in semantics of the relation between heartbeats and datanodeMap maps.
heartbeats contains only live nodes while datanodeMap contains both alive and dead nodes.
See JavaDoc for more details.
So when we are looking for new targets for block replication we should check the heartbeats
map rather than the datanodeMap as we did before.
Also since the DatanodeDescriptors are not physically removed from datanodeMap
I had to add their blocks cleanup while processing lost heartbeats.
Some changes to the FSImage and FSEditLog classes. I removed unnecessary
parameter to FSDirectory from the previous version. The whole name space can be
accessed via static method FSNamesystem.getFSNamesystem()

> Safe mode and name node startup procedures
> ------------------------------------------
>                 Key: HADOOP-306
>                 URL: http://issues.apache.org/jira/browse/HADOOP-306
>             Project: Hadoop
>          Issue Type: New Feature
>    Affects Versions: 0.3.2
>            Reporter: Konstantin Shvachko
>         Assigned To: Konstantin Shvachko
>             Fix For: 0.6.0
>         Attachments: FSImageSaveDNInfo.patch
> This is a proposal to improve DFS cluster startup process.
> The data node startup procedures were described and implemented in HADOOP-124.
> I'm trying to extend them to the name node here.
> The main idea is to introduce safe mode, which can be entered manually for administration
> purposes, or automatically when a configurable threshold of active data nodes is breached,
> or at startup when the node stays in safe mode until the minimal limit of active
> nodes is reached.
> This are high level requirements intended to improve the name node and cluster reliability.
>     = The name node safe mode means that the name node is not changing the state of the
>        file system. Meta data is read-only, and block replication / removal is not taking
>     = In safe mode the name node accepts data node registrations and
>        processes their block reports.
>     = The name node always starts in safe mode and stays safe until the majority
>         (a configurable parameter: safemode.threshold) of data nodes (or blocks?)
>         is reported.
>     = The name node can also fall into safe mode when the number of non-active
>         (heartbeats stopped coming in) data nodes becomes critical.
>     = The startup "silent period", when the name node is in safe mode and is
>         not issuing any block requests to the data nodes, is initially set to a
>         configurable value safemode.timeout.increment. By the end of the timeout
>         the name node checks the safemode.threshold and decides whether to switch
>         to the normal mode or to stay in safe. If the normal mode criteria is not
>         met, then the silent period is extended by incrementing the safemode timeout.
>     = The name node stays in safe mode not longer than a configurable value of
>         safemode.timeout.max, in which case it logs missing data nodes and shuts
>         itself down.
>     = When the name node switches to normal mode it checks whether all required
>         data nodes have actually registered, based on the list of active data storages
>         from the last session. Then it logs missing nodes, if any, and starts
>         replicating and/or deleting blocks as required.
>     = A historical list of data storages (nodes) ever registered with the cluster is
>         persistently stored in the image and log files. The list is used in two ways:
>         a) at startup to verify whether all nodes have registered, and to report
>         missing nodes;
>         b) at runtime if a data node registers with a new storage id the
>         name node verifies that no new blocks are reported from that storage,
>         which would prevent us from accidentally connecting data nodes from a
>         different cluster.
>     = The name node should have an option to run in safe mode. Starting with
>         that option would mean it never leaves safe mode.
>         This is useful for testing the cluster.
>     = Data nodes that can not connect to the name node for a long time (configurable)
>         should shut down themselves.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message