hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Milind Bhandarkar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-227) Namespace check pointing is not performed until the namenode restarts.
Date Sun, 26 Nov 2006 18:25:56 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-227?page=comments#action_12453449 ] 
Milind Bhandarkar commented on HADOOP-227:

Proposal for Checkpointing the Namesystem State in DFS

Currently, the namesystem state in memory consists of a tree
where the leaf nodes are files and internal nodes are directories.
This is serialized onto a disk file once when the namenode starts
up. During the namenode operation, the state changes are made in
memory, but the on-disk copy of the image is not modified. Instead,
a transaction log (called edits log) for the namesystem changes
is stored on the disk. When the namenode starts up again, it merges
the on-disk image and the edits log, and writes out the updated image.

While the namenode is in operation, the edits log keeps on growing,
irrespective of the size of the namesystem. For example, if a single file
is constantly created and deleted, the namesystem state will be of
constant size, however, the edits log will keep on growing.

Therefore it is necessary to periodically checkpoint the current state
of the namesystem, and to purge the edits log.

Image File Format on Disk

The image on the disk consists of a header followed by a list of
path entries, followed by known datanode entries. The header consists
of DFS version number, namespace ID, and number of path entries.
Each path entry corresponds to either a file or a directory. A file entry
contains the full path of the file, it's replication factor, number of
blocks, and a list of block-entries for blocks belonging to that file.
Each block entry consists of blockID and length of the block.
A directory entry consists of full path of the directory, replication
factor (which, for a directory is always 0), followed by 0. This last
0 distinguishes between files and directories.
The datanodes section of the image begins with number of known datanodes
from the datanode map that the FSNameSystem maintains, followed by
serialized form of each DatanodeDescriptor in that map.

Edits Log File Format on Disk

The edits log contains a list of namespace transactions. Each transaction
begins with a transaction op-code that signifies type of the transaction.
There are seven types of transactions, and each type contains different
properties associated with it. These types of transactions and their
associated properties are listed

1. Add File : Full path of the file, replication factor, list of blocks
2. Rename : Old path of the file, new path of the file
3. Delete : Path of the file deleted
4. Make Directory : Path of the created directory
5. Set Replication : Path of the file, new replication factor
6. Add Datanode : datanode descriptor of the node
7. Remove Datanode : datanode ID of the node

When should we Checkpoint ?

Checkpointing decision could be based on elapsed time (e.g. every hour)
or based on number of transactions (e.g. every 100,000 changes to the
namesystem). Since the later is approximately reflected in the size of
the edits log, this decision could also be based on the size of the
edits log. This is preferred since, the cpu and/or memory requirements
of checkpointing are determined by the size of the image as well as
size of the edits log. Also, this choice ensures that an idle
namenode is not checkpointed unnecessarily based on elapsed time.

How should we Checkpoint ?

There are a number of choices here. We describe each choice, and its pros
and cons below.

1. Lock the entire namespace in the main namenode thread, while we save
the entire image on disk.

This would disable all namenode operations while we are checkpointing
the image. That would include processing of heartbeats also, and would
cause datanodes to consider namenode to be dead, and cause cascading
DFS crash.

2. While saving an INode (i.e. path), only lock those nodes in the tree
from root to that node.

This would require extensive changes in the simple locking model used
in the namenode. Currently all the operations that the namenode
performs are fairly inexpensive. Therefore the simple locking model
that locks the entire namespace durung the transaction suffices. With
the new fine-grained locking model, one would need to acquire multiple
locks for each namspace operation, thus incurring additional overhead
for normal operations, which is the common case.

3. Lock the entire namespace while we make an in-memory copy of the
namespace. Hand the copy over to a checkpointing thread, unlock the

This would certainly be faster that option 1, since it does not involve
writing the namespace to disk while it is locked. However, it would
require double the amount of virtual memory to hold the namespace
while checkpointing is in progress.

4. Lock the namespace. Rename the edits log. Start a new edits log.
Unlock the namespace. Fork off a separate process, which loads old image
and old edits log, and saves new image.

This method suffers from the same problem of requiring double the virtual
memory as does option 3. In addition, the forked process doubles the
system resources such as open files and sockets.

5. Lock the namespace. Rename the edits log. Start a new edits log.
Unlock the namespace. Start a new thread, which merges old image on disk
with old edits log on disk to create a new image.

One observation that makes this proposal attractive is that the current
in-memory image of the namesystem can be recreated by merging the old
image on the disk with the current edits log. On the face of it, this
method would also suffer from extensive virtual memory requirements,
having to load an entire disk image into memory. However, upon closer
inspection, merging an on-disk image with on-disk edits log can be
achieved with very small memory footprint, if we change the way the image
is stored on disk. These changes and their rationale is explained below.

Each entry in the on-disk image corresponds to a path. The only
requirement on the order of these entries for successfully loading
the image is that the entries corresponding to directories be before
the entries for files within those directories.

If we store the image where entries are sorted by their 'path' field,
clearly entries for directories would be earlier than entries for files
within them. With the sorted image, checkpointing process would involve
in-memory sorting of edit log on the 'path' field, and then merging
the path-related ttransaction one path at a time before writing the
final path record on the new disk image.

Almost all the path-related transactions in the edits log correspond to
a single path. The only exception to this is the 'rename' transaction,
which corresponds to two paths, one old and one new. For files, this
transaction could be split into a pair of transactions that corresponds
to one path each. 'rename old-path new-path' can be split into
'delete old-path' and 'create new-path' for the sake of transaction
logging. Even for directories that are empty, one can do a similar
split. However, for directories that contains files and/or subdirectories
it becomes complicated, because each file/subdirectory under the
renamed directory needs to have a pair of log-entries corresponding
to a 'delete' and 'create'. This will increase the edits log size
by a factor proportional to the number of files in the renamed directory.

One approach to handle the directory rename operation while periodically
checkpointing, is to apply the rename operations both on the on-disk
image, and on the edits log entries previous to the rename operation.
After renaming, though, the image will need to be sorted again according
to the path names. This could be very expensive, since the on-disk image
of a large filesystem (>1PB) could be a few GB.

The edits log would typically be a few Megabytes at the time of
periodic checkpointing, typically much smaller than the image. We take
advantage of this size difference, and an observation about the directory
rename operation, to propose a solution to this problem.

A rename operation "rename srcPath dstPath" in the edits log can be
moved to the end of the edits log, by applying other rename operations
on the edits log entries timestamped after the rename entry. That is,

rename srcPath dstPath

operation can be removed, and 

rename srcPath dstPath
rename tmpPath srcPath

can be appended at the end of the edits log, if we apply the following
rename operations to all edits log entries that occur after the rename
directory operations:

rename srcPath tmpPath
rename dstPath srcPath

This way, we can manipulate the edits log, so that all the rename
directory transactions are moved to the end. Then, we remove these
rename operations into a separate file called the rename-table.

We change the definition of the on-disk image, so that it consists not
only of the sorted list of all paths in the namesystem, but also this
rename-table. When the namenode starts up, it loads the list of paths
into memory to form a filesystem tree, and the applies the rename
operations from the rename-table, to get the final image.

With this modification for handling directory-renames, our periodic
checkpointing algorithm becomes:

Id = On disk image
Rd = Rename table on the disk
Ed = Edits log on the disk

1. Load Ed into memory as a list of edits entries
2. Scan from the end of the edits log in order to find a directory rename
   By applying the transformation described above, move these renames
   towards the end. Do this for all directory renames.
3. Append these rename operation to Rd.
4. Sort the remaining edits log in memory.
5. Merge Id (which is sorted), and sorted Ed, to get a new Id.

The namenode startup procedure is also modified, to be:

1. Load Id, form a root directory tree.
2. For each entry in Rd, apply rename operations on the image.
3. Merge edits log, if exists.
4. Store back the image as a sorted list of paths.
5. Delete renames-table, and edits log.

> Namespace check pointing is not performed until the namenode restarts.
> ----------------------------------------------------------------------
>                 Key: HADOOP-227
>                 URL: http://issues.apache.org/jira/browse/HADOOP-227
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.2.0
>            Reporter: Konstantin Shvachko
>         Assigned To: Konstantin Shvachko
> In current implementation when the name node starts, it reads its image file, then
> the edits file, and then saves the updated image back into the image file.
> The image file is never updated after that.
> In order to provide the system reliability reliability the namespace information should
> be check pointed periodically, and the edits file should be kept relatively small.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message