hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2073) Datanode corruption if machine dies while writing VERSION file
Date Sat, 20 Oct 2007 05:43:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536402
] 

Hadoop QA commented on HADOOP-2073:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12368060/versionFileSize1.patch
against trunk revision r586264.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/973/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/973/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/973/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/973/console

This message is automatically generated.

> Datanode corruption if machine dies while writing VERSION file
> --------------------------------------------------------------
>
>                 Key: HADOOP-2073
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2073
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.14.0
>            Reporter: Michael Bieniosek
>            Assignee: Konstantin Shvachko
>             Fix For: 0.15.0
>
>         Attachments: versionFileSize.patch, versionFileSize1.patch
>
>
> Yesterday, due to a bad mapreduce job, some of my machines went on OOM killing sprees
and killed a bunch of datanodes, among other processes.  Since my monitoring software kept
trying to bring up the datanodes, only to have the kernel kill them off again, each machine's
datanode was probably killed many times.  A large percentage of these datanodes will not come
up now, and write this message to the logs:
> 2007-10-18 00:23:28,076 ERROR org.apache.hadoop.dfs.DataNode: org.apache.hadoop.dfs.InconsistentFSStateException:
Directory /hadoop/dfs/data is in an inconsistent state: file VERSION is invalid.
> When I check, /hadoop/dfs/data/current/VERSION is an empty file.  Consequently, I have
to delete all the blocks on the datanode and start over.  Since the OOM killing sprees happened
simultaneously on several datanodes in my DFS cluster, this could have crippled my dfs cluster.
> I checked the hadoop code, and in org.apache.hadoop.dfs.Storage, I see this:
> {{{
>     /**
>      * Write version file.
>      * 
>      * @throws IOException
>      */
>     void write() throws IOException {
>       corruptPreUpgradeStorage(root);
>       write(getVersionFile());
>     }
>     void write(File to) throws IOException {
>       Properties props = new Properties();
>       setFields(props, this);
>       RandomAccessFile file = new RandomAccessFile(to, "rws");
>       FileOutputStream out = null;
>       try {
>         file.setLength(0);
>         file.seek(0);
>         out = new FileOutputStream(file.getFD());
>         props.store(out, null);
>       } finally {
>         if (out != null) {
>           out.close();
>         }
>         file.close();
>       }
>     }
> }}}
> So if the datanode dies after file.setLength(0), but before props.store(out, null), the
VERSION file will get trashed in the corrupted state I see.  Maybe it would be better if this
method created a temporary file VERSION.tmp, and then copied it to VERSION, then deleted VERSION.tmp?
 That way, if VERSION was detected to be corrupt, the datanode could look at VERSION.tmp to
recover the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message