hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-7443) Datanode upgrade to BLOCKID_BASED_LAYOUT sometimes fails
Date Tue, 25 Nov 2014 14:18:13 GMT
Kihwal Lee created HDFS-7443:

             Summary: Datanode upgrade to BLOCKID_BASED_LAYOUT sometimes fails
                 Key: HDFS-7443
                 URL: https://issues.apache.org/jira/browse/HDFS-7443
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.6.0
            Reporter: Kihwal Lee
            Priority: Blocker

When we did an upgrade from 2.5 to 2.6 in a medium size cluster, about 4% of datanodes were
not coming up.  They treid data file layout upgrade for BLOCKID_BASED_LAYOUT introduced in
HDFS-6482, but failed.

All failures were caused by {{NativeIO.link()}} throwing IOException saying {{EEXIST}}.  The
data nodes didn't die right away, but the upgrade was soon retried when the block pool initialization
was retried whenever {{BPServiceActor}} was registering with the namenode.  After many retries,
datenodes terminated.  This would leave {{previous.tmp}} and {{current}} with no {{VERSION}}
file in the block pool slice storage directory.  

Although {{previous.tmp}} contained the old {{VERSION}} file, the content was in the new layout
and the subdirs were all newly created ones.  This shouldn't have happened because the upgrade-recovery
logic in {{Storage}} removes {{current}} and renames {{previous.tmp}} to {{current}} before
retrying.  All successfully upgraded volumes had old state preserved in their {{previous}}

In summary there were two observed issues.
- Upgrade failure with {{link()}} failing with {{EEXIST}}
- {{previous.tmp}} contained not the content of original {{current}}, but half-upgraded one.

This message was sent by Atlassian JIRA

View raw message