hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3510) Fix FSEditLog pre-allocation
Date Mon, 18 Jun 2012 21:22:42 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396267#comment-13396267
] 

Colin Patrick McCabe commented on HDFS-3510:
--------------------------------------------

Yep, this JIRA is still valid.  The problem is that we don't always pre-allocate enough space
on the disk.  There is a check that looks like this:

bq. if (position + 4096 >= fc.size()) { ... do preallocation ...

As you can see, this will not work correctly if the next batch of writes to the edit log is
greater than 4096 bytes.

For this reason, we continue to get bug reports and mailing list posts about how running out
of disk space leads to a corrupt edit log.
For an example of a public one, check out https://groups.google.com/a/cloudera.org/group/scm-users/browse_thread/thread/3ec955a120daf241?hl=en#
(This mailing list post pertains to CDH3 / branch-1, but the code in question has the same
problem.)

If you don't have time to look at the code, you can also check out the ASCII art in the description
of this JIRA.  As always, thanks for taking the time to look at this.
                
> Fix FSEditLog pre-allocation
> ----------------------------
>
>                 Key: HDFS-3510
>                 URL: https://issues.apache.org/jira/browse/HDFS-3510
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 2.0.0-alpha
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>             Fix For: 1.0.0, 2.0.1-alpha
>
>         Attachments: HDFS-3510-b1.001.patch, HDFS-3510-b1.002.patch, HDFS-3510.001.patch,
HDFS-3510.003.patch, HDFS-3510.004.patch, HDFS-3510.004.patch, HDFS-3510.006.patch, HDFS-3510.007.patch,
HDFS-3510.008.patch
>
>
> In the FSEditLog, we want to avoid running out of space in the middle of writing an edit
log operation to the disk. We do this by a process called "preallocation"-- reserving space
on the disk for the upcoming edit log entries before beginning to write them.
> The idea is that if we're going to encounter an out-of-disk-space condition, we don't
want it to happen in the middle of writing valid data.  Instead, we want it to happen in the
middle of writing padding bytes.  The edit log uses bytes with the value 0xff (in decimal,
-1) as padding.  These bytes correspond to FSEditLogOp.OP_INVALID.
> The current preallocation strategy is flawed.  Although we preallocate a very large chunk
at a time-- 1 megabyte, in fact-- we only do this preallocation when we are more than 4096
bytes away from the end of the file.  This means that the effective preallocation length is
only 4096 bytes.  A batch of edit log entries could easily be more than this.  There is evidence
that this has caused problems in the field for end-users.
> Here is a visual illustration of the old preallocation strategy:
> {code}
> first write
> |
> V <----- 1 MB ----->
> +--+---------------+
> |__|FFFFFFFFFFFFFFF|
> +--+---------------+
>     second write
>     |
>     V
> +--+------+--------+
> |__|______|FFFFFFFF|
> +--+------+--------+
>            third write
>            |
>            V
> +--+------+------+-+
> |__|______|______|_|
> +--+------+------+-+
>                   fourth write
>                   | (NOT preallocated)
>                   V
> +--+------+------+-+
> |__|______|______|________
> +--+------+------+-+
>                           fifth write
>                           |
>                           V<--- 1 MB -->
> +--+------+------+--------+---+--------+
> |__|______|______|________|___|FFFFFFFF|
> +--+------+------+--------+---+--------+
> {code}
> And here is the new preallocation strategy:
> {code}
> first write
> |
> V <----- 1 MB ----->
> +--+---------------+
> |__|FFFFFFFFFFFFFFF|
> +--+---------------+
>     second write
>     |
>     V
> +--+------+--------+
> |__|______|FFFFFFFF|
> +--+------+--------+
>            third write
>            |
>            V
> +--+------+------+-+
> |__|______|______|_|
> +--+------+------+-+
>                   fourth write
>                   |
>                   V <------ 1MB-->
> +--+------+------+--------+------+
> |__|______|______|________|      |
> +--+------+------+--------+------+
>                           fifth write
>                           |
>                           V
> +--+------+------+--------+---+--+
> |__|______|______|________|___|  |
> +--+------+------+--------+---+--+
> {code}
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message