hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1073) Simpler model for Namenode's fs Image and edit Logs
Date Thu, 08 Apr 2010 19:09:36 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855096#action_12855096

Todd Lipcon commented on HDFS-1073:

bq. The serial numbering of the files solution requires that checkpoints occur only at a edits
split boundaries.

Yes, but since we can split edits at will, I don't think there's any problem just having the
backupnode asking the active NN to roll whenver the BN would like to do a checkpoint. The
nice thing about this is that an image file from the BN can be lined up exactly with the corresponding
edit logs from the NN, etc.

bq. The transaction ID one does not have that restriction but it does require that in order
to detect a gap in edits one has to look inside the logs. The txId one can avoid that if we
are prepared to rename the edits log when you split (roll) it (Ugh!)

Agreed re ugh! The renaming is the complexity we're trying to avoid, no?

bq. The txId numbering scheme also has the advantage that multiple backups can roll and do
checkpoints independently (we DONOT want to do that as it will confuse the operators –
but it shows that the design is very robust.

I still think this is possible with sequential numbering. And I agree that not confusing operators
is a key design goal for this JIRA. The whole image/edit log thing in normal operation should
be an implementation detail, and when operators have to look at it they're usually very stressed
out because a cluster is corrupt - so we want to make it _very_ clear what's going on, and
_very_ hard to create any state that is unrecoverable.


I've started working on this patch and it's coming along nicely. The NN and secondary NN are
working great, and just started on the BN/Checkpointer. Here's a brief overview of the design
I'm going with - hopefully I will answer the above questions along the way.

h2. Storage contents

The NN storage directories continue to be organized in the same way - either edits, images,
or both. The difference is that each edits or fsimage file now has a suffix indicating its
"roll index". For example, a newly formatted NN has the following contents:

- fsimage_0 - empty image
- edits_0_inprogress - the edit log currently being appended

When edits are rolled, the current 'edits_N_inprogress' file is "finalized" by renaming to
simply edits_N. So, if we roll the edits of the above image, we end up with:

- fsimage_0 - same empty image
- edits_0 - any edits made before the roll
- edits_1_inprogress

When an image is saved or uploaded via a checkpoint, the validity rule is as follows: any
fsimage with roll index N must incorporate all edits from logs with a roll index less than
N. So, if we enter safe mode and call saveNamespace on the above example, we end up with:

- fsimage_0 - original empty imagge
- edits_0 - edits before first roll
- edits_1 - edits before saveNamespace
- fsimage_2 - all edits from edits_0 and edits_1
- edits_2_inprogress - the edit log where new edits will be appended

h2. Log Rolling Triggers

The following events can trigger a log roll:
- NN startup (see below)
- saveNamespace
- a secondary or backup node wants to begin a checkpoint
- an IOException has occurred on one of the current edit logs
- potentially we may find it useful to expose this as an admin function? (eg mysql offers
a flush logs; command)

h2. Log rolling behavior:

- The current edits_N_inprogress log is closed
- The current edits_N_inprogress log is renamed to edits_N in all valid edits directories.
- Any edits directories that previously had problems will be left with edits_N_inprogress
(since we don't know whether all of the edits made it into that log before the roll, in fact
they probably did not)
- The next edits_N+1_inprogress is opened in all directories, including an attempt to reopen
any failed directories.

h2. Startup behavior

First we initiate log recovery:

- Across all edits directories, look for any edits_N_inprogress:
-- If one is found, look for a finalized edits_N file in any other log directory
--- If there is at least one finalized edits_N, then the edits_N_inprogress is likely corrupt
-- rename it to edits_N_corrupt (or delete it if we are less cautious)
-- If there are no finalized edits_N files, then the NN crashed while we were writing log
index N. Initiate recovery process across all edits_N_inprogress:
--- Currently this isn't fancy - I just pick one. However, we could scan each of the logs
for OP_INVALID and find the longest one, ensure that they have the same length, etc (eg one
log must not have caught the last edit, or been truncated, etc)
--- This is very simple to do since across all directories (including secondaries) edits_M
for any M should be identical!
--- After we've determined the correct log(s), finalize it and remove the others

Next, find the fsimage_N with the highest N across all image directories.
Then, find the edits_M with the highest M across all edits directories.

For safety, we check that there exists an edits_X for all X between N and M inclusive.

We then start up the NN by the following sequence:
- load fsimage_N
- for each M through N inclusive, load edits_N
- if we loaded any edits, save fsimage_N+1
- open edits_inprogress_N+1

h2. Checkpoint process

- Checkpoint Signature is modified to include the latest image index and the current log index
in progress.
- Checkpointing node issues beginCheckpoint to NN
- NN rolls edit logs, and returns a checkpoint signature that includes the latest stored fsimage_N,
as well as the index of the log it just rolled to
- Image transfer servlet is augmented to allow the downloader to specify which image or edits
file to download
- Checkpointer downloads fsimage_N and edits_N through edits_M (where M is the new finalized
edit log from the roll)
- Checkpointer saves local fsimage_M+1, and uploads to NN
- NN validation of the checkpoint signature is much simpler - just needs to make sure it came
from the same filesystem, check any security tokens, etc. The old fstime and editstime constructs
are no longer necessary since it's all encapsulated in the index numbers. For extra safety
we can easily add some checksum or log length info to the CheckpointSignature
- NN saves fsimage_M+1 into its local image dirs, but does not need to do any log manipulation.

I'm still working out the backupnode operation, but I think it will actually be simplified
by this proposal. Rather than having a special journaling mode, I think the NN can simply
push any log roll events through the edit log stream to the BN. This will keep the roll indexes
(and log contents) on the BN exactly identical to the indexes on the NN, which has good operational
advantages and also reduces code complexity in the BN.

h2. Handling multiple checkpointers

Note that in the above process there is no state stored on the NN with regard to ongoing checkpoint
processes. If multiple checkpoint nodes checkpoint simultaneously, the NN will simply roll
twice and hand a different index to each. Each will then upload fsimages with different indexes.

h2. Image/edits file retention policies

There are a number of policies that should be simple to implement:

- *Number of saved images* - ensure that we have at least N saved images in our image directories,
can delete any that are more than N versions old. Maintain edit lots that have index >=
the index of the Nth oldest image.
- *Time* - ensure that we maintain all images within a trailing time window - again maintain
all edit logs with index >= index of oldest maintained image.
- *Archival* - for audit purposes, the deletion mechanism could very easily be augmented to
archive the edit logs for later analysis (eg to HDFS, tape, SAN, etc)

So long as any fsimage_N and all edits_M where M >= N are retained somewhere, they can
be copied back into the NN's storage directories and full PITR is possible.

> Simpler model for Namenode's fs Image and edit Logs 
> ----------------------------------------------------
>                 Key: HDFS-1073
>                 URL: https://issues.apache.org/jira/browse/HDFS-1073
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Sanjay Radia
>            Assignee: Todd Lipcon
> The naming and handling of  NN's fsImage and edit logs can be significantly improved
resulting simpler and more robust code.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message