hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HDFS-1221) NameNode unable to start due to stale edits log after a crash
Date Thu, 05 Aug 2010 22:06:20 GMT

     [ https://issues.apache.org/jira/browse/HDFS-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Konstantin Shvachko resolved HDFS-1221.

    Resolution: Not A Problem

> NameNode unable to start due to stale edits log after a crash
> -------------------------------------------------------------
>                 Key: HDFS-1221
>                 URL: https://issues.apache.org/jira/browse/HDFS-1221
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.1
>            Reporter: Thanh Do
> - Summary: 
> If a crash happens during FSEditLog.createEditLogFile(), the
> edits log file on disk may be stale. During next reboot, NameNode 
> will get an exception when parsing the edits file, because of stale data, 
> leading to unsuccessful reboot.
> Note: This is just one example. Since we see that edits log (and fsimage)
> does not have checksum, they are vulnerable to corruption too.
> - Details:
> The steps to create new edits log (which we infer from HDFS code) are:
> 1) truncate the file to zero size
> 2) write FSConstants.LAYOUT_VERSION to buffer
> 3) insert the end-of-file marker OP_INVALID to the end of the buffer
> 4) preallocate 1MB of data, and fill the data with 0
> 5) flush the buffer to disk
> Note that only in step 1, 4, 5, the data on disk is actually changed.
> Now, suppose a crash happens after step 4, but before step 5.
> In the next reboot, NameNode will fetch this edits log file (which contains
> all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK,
> because NameNode has code to handle that case.
> (but we expect LAYOUT_VERSION to be -18, don't we). 
> Now it parses the operation code, which happens to be 0. Unfortunately, since 0
> is the value for OP_ADD, the NameNode expects some parameters corresponding 
> to that operation. Now NameNode calls readString to read the path, which throws
> an exception leading to a failed reboot.
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and 
> Haryadi Gunawi (haryadi@eecs.berkeley.edu)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message