hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (HDFS-2053) NameNode detects "Inconsistent diskspace" for directories with quota-enabled subdirectories (introduced by HDFS-1377)
Date Thu, 09 Jun 2011 20:06:59 GMT

     [ https://issues.apache.org/jira/browse/HDFS-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Eli Collins reassigned HDFS-2053:

    Assignee: Michael Noll

Hey Michael - thank you for the excellent report!

In summary, the condition used to warn in FSDirectory#computeContentSummary has a bug, it
compares the cached value for the directory not to a computed value for that directory but
to a computed value that includes the directory and it's siblings. 

The bug results in a spurious warning, it doesn't impact eg the correctness of quotas. Given
this I think two things are reasonable:
# Remove the warning (which removes the bug)
# Compute the correct summary for just that directory (your patch)

The latter sounds good to me. Allocating a 4 long array for each level in the directory hierarchy
isn't bad and this method isn't on a hot path.

Nit, I'd change array allocation to the following since we assume summary has len 4 and should
be faster.

assert 4 == summary.length;
long[] subtreeSummary = new long[]{0,0,0,0}

Wrt testing how about right after space is calculated adding the following:

assert -1 == node.getDsQuota() || space == subtreeSummary[3];

Asserts are enabled by default when the tests are run, if TestQuota doesn't trigger this assert
then add a test similar to what you did manullay which will trigger it.

Also, please generate a patch against trunk (HDFS-2053_v2.txt doesn't apply for me).


> NameNode detects "Inconsistent diskspace" for directories with quota-enabled subdirectories
(introduced by HDFS-1377)
> ---------------------------------------------------------------------------------------------------------------------
>                 Key: HDFS-2053
>                 URL: https://issues.apache.org/jira/browse/HDFS-2053
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.3,,
>         Environment: Hadoop release with the HDFS-1377 patch applied.
> My impression is that the same issue exists also in the other branches where the HDFS-1377
patch has been applied to (see description).
>            Reporter: Michael Noll
>            Assignee: Michael Noll
>            Priority: Minor
>             Fix For: 0.20.3,,
>         Attachments: HDFS-2053_v1.txt, HDFS-2053_v2.txt
> *How to reproduce*
> {code}
> # create test directories
> $ hadoop fs -mkdir /hdfs-1377/A
> $ hadoop fs -mkdir /hdfs-1377/B
> $ hadoop fs -mkdir /hdfs-1377/C
> # ...add some test data (few kB or MB) to all three dirs...
> # set space quota for subdir C only
> $ hadoop dfsadmin -setSpaceQuota 1g /hdfs-1377/C
> # the following two commands _on the parent dir_ trigger the warning
> $ hadoop fs -dus /hdfs-1377
> $ hadoop fs -count -q /hdfs-1377
> {code}
> Warning message in the namenode logs:
> {code}
> 2011-06-09 09:42:39,817 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: Inconsistent
diskspace for directory C. Cached: 433872320 Computed: 438465355
> {code}
> Note that the commands are run on the _parent directory_ but the warning is shown for
the _subdirectory_ with space quota.
> *Background*
> The bug was introduced by the HDFS-1377 patch, which is currently committed to at least
branch-0.20, branch-0.20-security, branch-0.20-security-204, branch-0.20-security-205 and
release-0.20.3-rc2.  In the patch, {{src/hdfs/org/apache/hadoop/hdfs/server/namenode/INodeDirectory.java}}
was updated to trigger the warning above if the cached and computed diskspace values are not
the same for a directory with quota.
> The warning is written by {{computecontentSummary(long[] summary)}} in {{INodeDirectory}}.
In the method an inode's children are recursively walked through while the {{summary}} parameter
is passed and updated along the way.
> {code}
>   /** {@inheritDoc} */
>   long[] computeContentSummary(long[] summary) {
>     if (children != null) {
>       for (INode child : children) {
>         child.computeContentSummary(summary);
>       }
>     }
> {code}
> The condition that triggers the warning message compares the current node's cached diskspace
(via {{node.diskspaceConsumed()}}) with the corresponding field in {{summary}}.
> {code}
>       if (-1 != node.getDsQuota() && space != summary[3]) {
>         NameNode.LOG.warn("Inconsistent diskspace for directory "
>           +getLocalName()+". Cached: "+space+" Computed: "+summary[3]);
> {code}
> However {{summary}} may already include diskspace information from other inodes at this
point (i.e. from different subtrees than the subtree of the node for which the warning message
is shown; in our example for the tree at {{/hdfs-1377}}, {{summary}} can already contain information
from {{/hdfs-1377/A}} and {{/hdfs-1377/B}} when it is passed to inode {{/hdfs-1377/C}}). 
Hence the cached value for {{C}} can incorrectly be different from the computed value.
> *How to fix*
> The supplied patch creates a fresh summary array for the subtree of the current node.
 The walk through the children passes and updates this {{subtreeSummary}} array, and the condition
is checked against {{subtreeSummary}} instead of the original {{summary}}.  The original {{summary}}
is updated with the values of {{subtreeSummary}} before it returns.
> *Unit Tests*
> I have run "ant test" on my patched build without any errors*.  However the existing
unit tests did not catch this issue for the original HDFS-1377 patch, so this might not mean
anything. ;-)
> That said I am unsure what the most appropriate way to unit test this issue would be.
 A straight-forward approach would be to automate the steps in the _How to reproduce section_
above and check whether the NN logs an incorrect warning message.  But I'm not sure how this
check could be implemented.  Feel free to provide some pointers if you have some ideas.
> *Note about Fix Version/s*
> The patch _should_ apply to all branches where the HDFS-1377 patch has committed to.
 In my environment, the build was Hadoop release with a (trivial) backport of HDFS-1377
( release does not ship with the HDFS-1377 fix).  I could apply the patch successfully
to {{branch-0.20-security}}, {{branch-0.20-security-204}} and {{release-0.20.3-rc2}}, for
instance.  Since I'm a bit confused regarding the upcoming 0.20.x release versions (0.20.x
vs. 0.20.20x.y) I have been so bold and added to the list of affected versions
even though it is actually only affected when HDFS-1377 is applied to it...
> Best,
> Michael
> *Well, I get one error for {{TestRumenJobTraces}} but first this seems to be completely
unrelated and second I get the same test error when running the tests on the stock
release build.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message