hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5223) Allow edit log/fsimage format changes without changing layout version
Date Wed, 18 Sep 2013 18:53:53 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771100#comment-13771100
] 

Todd Lipcon commented on HDFS-5223:
-----------------------------------

To expand a little bit on Aaron's summary of our discussion above.

*Proposal 1*:
- note that we already include a version number in the header of the edit log and image formats.
So, within a single image or edits directories, you might now have different edit log segments
or images with different version numbers -- the ones written post-upgrade would have a higher
version number.
- note that this allows for in-place software upgrade, but not in-place software downgrade.
Once you've written an edit log with the new version, you couldn't downgrade the NN back to
the previous version, because it would refuse to read the higher-versioned edit log segment.

bq. and we would require that changes made to the format of existing fsimage/edit log entries
be done in a backward compatible fashion

This isn't quite the case -- because the new edit log segments would have a new version number,
we have the same ability to evolve opcodes as today. I verified with Aaron that he mis-stated
this above.

*Proposal 2*:
- This is basically the way that file systems such as ext3 handle version compatibility. Every
ext3 filesystem's superblock contains a set of flags which determine which features have been
enabled for it. Similarly, we'd add something to the edit log and fsimage headers with a set
of feature names. Here's the docs from Documentation/filesystems/ext2.txt in the kernel tree:

{code}
These feature flags have specific meanings for the kernel as follows:

A COMPAT flag indicates that a feature is present in the filesystem,
but the on-disk format is 100% compatible with older on-disk formats, so
a kernel which didn't know anything about this feature could read/write
the filesystem without any chance of corrupting the filesystem (or even
making it inconsistent).  This is essentially just a flag which says
"this filesystem has a (hidden) feature" that the kernel or e2fsck may
want to be aware of (more on e2fsck and feature flags later).  The ext3
HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
a regular file with data blocks in it so the kernel does not need to
take any special notice of it if it doesn't understand ext3 journaling.

An RO_COMPAT flag indicates that the on-disk format is 100% compatible
with older on-disk formats for reading (i.e. the feature does not change
the visible on-disk format).  However, an old kernel writing to such a
filesystem would/could corrupt the filesystem, so this is prevented. The
most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
sparse groups allow file data blocks where superblock/group descriptor
backups used to live, and ext2_free_blocks() refuses to free these blocks,
which would leading to inconsistent bitmaps.  An old kernel would also
get an error if it tried to free a series of blocks which crossed a group
boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.

An INCOMPAT flag indicates the on-disk format has changed in some
way that makes it unreadable by older kernels, or would otherwise
cause a problem if an old kernel tried to mount it.  FILETYPE is an
INCOMPAT flag because older kernels would think a filename was longer
than 256 characters, which would lead to corrupt directory listings.
The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
doesn't understand compression, you would just get garbage back from
read() instead of it automatically decompressing your data.  The ext3
RECOVER flag is needed to prevent a kernel which does not understand the
ext3 journal from mounting the filesystem without replaying the journal.
{code}

This would allow us to do rolling upgrades, run mixed-version clusters, and still retain the
ability to roll back to a prior version until the new feature was used. So, to take the example
of a feature like snapshots which required a metadata change, the admin workflow would be:

# Shutdown standby node
# Upgrade standby software version
# Start standby node, failover to it
# Shutdown and upgrade the old active, start it back up.
# Note: at this point, the format for the edit logs and images is identical to the pre-upgrade
format, so the user could still roll back. Trying to create a snapshot at this point would
fail with an error like "Snapshots not enabled for this filesystem. Run dfsadmin -enableFeature
snapshots to enable"
# User runs the above command, which forces an edit log roll. The new edit logs contain the
flag indicating that snapshots are enabled, and may use the new opcodes (or add new fields
to the old opcodes as necessary)

If the "explicit enable" doesn't sit well with people, we could also add a slightly simpler
version like "-enableAllNewFeatures" or whatever, which a user can use after an upgrade with
the understanding that it will prevent rollback.


I personally prefer option 2 -- it helps a lot with the HA upgrade scenario per above, allows
rollback, and also has the nice property that it will allow us to selectively backport features
between software versions without bizarre non-linear version numbering hacks like we have
today.
                
> Allow edit log/fsimage format changes without changing layout version
> ---------------------------------------------------------------------
>
>                 Key: HDFS-5223
>                 URL: https://issues.apache.org/jira/browse/HDFS-5223
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.1.1-beta
>            Reporter: Aaron T. Myers
>
> Currently all HDFS on-disk formats are version by the single layout version. This means
that even for changes which might be backward compatible, like the addition of a new edit
log op code, we must go through the full `namenode -upgrade' process which requires coordination
with DNs, etc. HDFS should support a lighter weight alternative.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message