hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-955) FSImage.saveFSImage can lose edits
Date Thu, 11 Feb 2010 22:32:30 GMT

    [ https://issues.apache.org/jira/browse/HDFS-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832706#action_12832706
] 

Todd Lipcon commented on HDFS-955:
----------------------------------

I worked through this a bit last night. Here are some options for a solution.

h3. 1. Add "undo log" file to storage directory

In this solution, we add a new file called "undolog" in each storage directory. Whenever we're
in the midst of a transition, we write some bit of data in this file that explains what the
proper rollback procedure is. Thus, for the checkpoint from the checkpoint node, we'd write
a file that says "if IMAGE_NEW is complete, use IMAGE_NEW + EDITS_NEW. Otherwise use IMAGE
+ EDITS + EDITS_NEW". For the saveNamespace operation, we'd write "If IMAGE_NEW is complete,
use IMAGE_NEW. Otherwise use IMAGE + EDITS + EDITS_NEW".

This has the advantage of making the recovery choices explicit during all state transitions
- we're forced to think carefully after each step of the operation in order to maintain the
undo instructions.

On the downside, it's more complexity.

h3. 2. Don't allow -saveNamespace when the logs are in ROLLED state

I don't like this one at all, but it would allow us to always use the IMAGE_NEW + EDITS_NEW
recovery.

h3. 3. Redesign rolling to not reuse filenames

This is a much bigger change, but I think it would also help simplify a lot of the code. The
proposal here is to manage edit logs in a way that's similar to what MySQL does. Specifically,
instead of IMAGE and IMAGE_NEW plus EDITS and EDITS_NEW, we simply have a monotonically increasing
identifier on each log file. So, the state of the system starts with image_0 and edits_0.
Logs may be rolled at any point, which increments edits_N. So in a normal operation we'd see:

image_0
edits_0 <- writing here

[roll edits]
image_0
edits_0
edits_1 <- writing here

[checkpoint node fetches image_0 and edits_0, and uploads images_1]

image_0 <- this is now "stale" and can be garbage collected later
image_1 <- this contains image_0 + edits_0
edits_0 <- this is also stale
edits_1 <- still being written

This design has many plusses in my view:
# Files never change names, and thus race conditions like HDFS-909 are less likely, so long
as the current number is synchronized.
# Recovery is much simpler - you can always recover from image_n + edits_n through edits_max,
so long as image_n is complete. Any incomplete or corrupt images can always be safely ignored
so long as there is an earlier one, plus all the edit logs going back to that point.
# the fstime checking logic is simplified - an image made from image_N plus edits_N through
edits_(M-1) is always going to be called image_M. Any image_M from any storage directory should
be identical regardless of any ongoing rolls.
# edit logs and images can both be kept for some time window, simpifying backup and recovery
a bit while also providing an easy mechanism for point-in-time recovery of the namespace.
Although PITR is less than useful if data blocks are gone, this mechanism would make it impossible
for a bug like HDFS-909 or HDFS-955 to lose edits, since files are never truncated or removed
until after they're "stale".
# We no longer have to be careful about the NN's "rolled" vs "upload_done" vs "start" state
- the logs are looked at as constantly rolling, and it's always clear where to apply a checkpoint
image.

The downside, of course, is that it's a very big change, definitely not a candidate for backport,
and could take a while.

h3. 4. Distinguish IMAGE_NEW_CKPT vs IMAGE_NEW_SAVED

Rather than having a single IMAGE_NEW filename like we do now, we could split it into IMAGE_NEW_CKPT
and IMAGE_NEW_SAVED. The recovery mechanism for these would differ in that, if there is a
completed IMAGE_NEW_CKPT, then it will recover IMAGE_NEW_CKPT + EDITS_NEW. If there is a completed
IMAGE_NEW_SAVED, then it can truncate both EDITS and EDITS_NEW during recovery, since a saved
namespace encompasses both.


Unfortunately, not one of these is a simple fix. If you have any proposals that are both simple
and correct, I'd be very interested to hear them.

One thing I'd also like to consider more is the interaction of these processes with filesystem
journaling. I'm not sure if ext3's data=ordered
journaling mode (probably the most common deployment configuration) guarantees quite enough
ordering between different files that all of the above will work correctly in the event of
host failures. I need to learn more about that and report back.

> FSImage.saveFSImage can lose edits
> ----------------------------------
>
>                 Key: HDFS-955
>                 URL: https://issues.apache.org/jira/browse/HDFS-955
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.1, 0.21.0, 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>         Attachments: hdfs-955-unittest.txt, PurgeEditsBeforeImageSave.patch
>
>
> This is a continuation of a discussion from HDFS-909. The FSImage.saveFSImage function
(implementing dfsadmin -saveNamespace) can corrupt the NN storage such that all current edits
are lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message