hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Mackrory (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10797) Disk usage summary of snapshots causes renamed blocks to get counted twice
Date Wed, 28 Sep 2016 23:22:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15531199#comment-15531199

Sean Mackrory commented on HDFS-10797:

Thanks for pointing that out [~jingzhao]. I added test cases to address some inter-directory
renames. Of course, some of them are broken and still reported the wrong usage. I'd really
like to come up with a way for the semantics to be both consistent and unsurprising to a user.
I improved the situation somewhat by computing which nodes were deleted (as opposed to renames)
in the context of all the diffs for a directory instead of each diff individually. So it's
a step in the right direction but the real fix would be to have some global context when computing
usage that ensures each INode in the hierarchy is counted exactly once. It looks to me like
that's going to require some refactoring, since although the counts are cumulative, they can
accumulate in multiple distinct objects before being combined. We would need to refactor some
functions that so all counts were added directly to a single object, and that same object
could prevent nodes from being counted twice, once because they were removed from a snapshotted
directory, and again because of where they reside now.

Thoughts on this approach before I go further?

> Disk usage summary of snapshots causes renamed blocks to get counted twice
> --------------------------------------------------------------------------
>                 Key: HDFS-10797
>                 URL: https://issues.apache.org/jira/browse/HDFS-10797
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>         Attachments: HDFS-10797.001.patch, HDFS-10797.002.patch, HDFS-10797.003.patch
> DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how much disk
usage is used by a snapshot by tallying up the files in the snapshot that have since been
deleted (that way it won't overlap with regular files whose disk usage is computed separately).
However that is determined from a diff that shows moved (to Trash or otherwise) or renamed
files as a deletion and a creation operation that may overlap with the list of blocks. Only
the deletion operation is taken into consideration, and this causes those blocks to get represented
twice in the disk usage tallying.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message