hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "alan wootton (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-90) DFS is succeptible to data loss in case of name node failure
Date Tue, 30 May 2006 19:10:31 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-90?page=comments#action_12413885 ] 

alan wootton commented on HADOOP-90:

What we plan is to securely backup the 'snapshot' (the image file) after the NameNode is started.
So that takes care of that part.  Now it's all about saving the edit log in a safe way. I
don't like the idea of the backup being even one minute old.

The failure mode is the loss of a hard drive on the NameNode server. Simply writing, ALL the
time, the edit log to more than one path, and therefore to more than one drive, goes a long
way towards making the data secure. If the node dies you would still have to retreive one
of the edit files from one of the drives, but at least one of the drives should still work
(there are 3 on our namenode). Someone needs to pull the drives and mount them on another
machine before recovery can happen, but hey, it's going to be a rare event.

If you are using Solaris, as we are, then sun nfs is available (or so they tell me). We mount
an nfs drive, and write the edit log to that drive also. In this case recovery can happen
by copying the image, and the edits, from nfs to another node, changing DNS for the name of
the namenode, and starting a new namenode. 

I feel, at least for us, that this IS a real solution, and not just a band-aid. 

> DFS is succeptible to data loss in case of name node failure
> ------------------------------------------------------------
>          Key: HADOOP-90
>          URL: http://issues.apache.org/jira/browse/HADOOP-90
>      Project: Hadoop
>         Type: Bug

>   Components: dfs
>     Versions: 0.1.0
>     Reporter: Yoram Arnon
>  Attachments: multipleEditsDest.patch
> Currently, DFS name node stores its log and state in local files.
> This has the disadvantage that a hardware failure of the name node causes a total data
> Several approaches may be used to address this flaw:
> 1. replicate the name server state files using copy or rsync once in a while, either
manually or using a cron job.
> 2. set up secondary name servers and a protocol whereby the primary updates the secondaries.
In case of failure, a secondary can take over.
> 3. store the state files as distributed, replicated files in the DFS itself. The difficulty
is that it becomes a bootstrap problem, where the name node needs some information, typically
stored in its state files, in order to read those same state files.
> solution 1 is fine for non critical systems, but for systems that need to guarantee no
data loss it's insufficient.
> Solutions 2 and 3 both seem valid; 3 seems more elegant in that it doesn't require an
extra protocol, it leverages the DFS and allows any level of replication for robustness. Below
is a proposition for  solution 3.
> 1.	The name node, when it starts up, needs some basic information. That information is
not large and can easily be stored in a single block of DFS. We hard code the block location,
using block id 0. Block 0 will contain the list of blocks that contain the name node metadata
- not the metadata itself (file names, servers, blocks etc), just the list of blocks that
contain it. With a block identified by 8 bytes, and 32 MB blocks, we can fit 256K block id's
in block 0. 256K blocks of 32MB each can hold 8TB of metadata, which can map a large enough
file system, so a single block of block_ids is sufficient.
> 2.	The name node writes his state basically the same way as now: log file plus occasional
full state. DFS needs to change to commit changes to open files while allowing continued writing
to them, or else the log file wouldn't be valid on name server failure, before the file is
> 3.	The name node will use double buffering for its state, using blocks 0 and 1. Starting
with block 0, it writes its state, then a log of changes. When it's time to write a new state
it writes it to node 1. The state includes a generation number, a single byte starting at
0, to enable the name server to identify the valid state. A CRC is written at the end of the
block to mark its validity and completeness. The log file is identified by the same generation
number as the state it relates to. 
> 4.	The log file will be limited to a single block as well. When that block fills up a
new state is written. 32MB of transaction logs should suffice. If not, we could set aside
a set of blocks, and set aside a few locations in the super-block (block 0/1) to store that
set of block ids.
> 5.	The super-block, the log and the metadata blocks may be exposed as read only files
in reserved files in the DFS: /.metadata/* or something.
> 6.	When a name nodes starts, it waits for data nodes to connect to it to report their
blocks. It waits until it gets a report about blocks 0 and 1, from which it can continue to
read its entire state. After that it continues normally.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message