hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Mackrory (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10797) Disk usage summary of snapshots causes renamed blocks to get counted twice
Date Thu, 25 Aug 2016 23:35:21 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15438128#comment-15438128

Sean Mackrory commented on HDFS-10797:

To reproduce the discrepancy you can follow the following procedure. I put a 100 MB file into
HDFS and snapshot it (hadoop fs -du -s reports 100 MB * replication after both operations),
and then append another 100 MB onto it (hadoop fs -du -s will report 200 MB * replication
factor at that point). If I move the file to trash or simply rename it, hadoop fs -du -s starts
reporting 300 MB * replication factor in the second column. I believe at this point it is
counting some of the overlap in block between the snapshot and the regular file twice, because
it views the move operation the same as a delete, but since the file wasn't actually deleted
it gets counted again.
dd if=/dev/zero of=100MB.zero bs=10000 count=10000
bin/hadoop fs -mkdir -p /user/sean
bin/hadoop fs -chown sean /user/sean
bin/hadoop fs -put 100MB.zero /user/sean/HDFS-10797

bin/hdfs dfsadmin -allowSnapshot /user/sean
bin/hdfs dfs -createSnapshot /user/sean s1

bin/hadoop fs -appendToFile 100MB.zero /user/sean/HDFS-10797

bin/hadoop fs -du -s /user/sean

bin/hadoop fs -rm /user/sean/HDFS-10797 # or simply rename with mv
bin/hadoop fs -du -s /user/sean

> Disk usage summary of snapshots causes renamed blocks to get counted twice
> --------------------------------------------------------------------------
>                 Key: HDFS-10797
>                 URL: https://issues.apache.org/jira/browse/HDFS-10797
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Sean Mackrory
> DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how much disk
usage is used by a snapshot by tallying up the files in the snapshot that have since been
deleted (that way it won't overlap with regular files whose disk usage is computed separately).
However that is determined from a diff that shows moved (to Trash or otherwise) or renamed
files as a deletion and a creation operation that may overlap with the list of blocks. Only
the deletion operation is taken into consideration, and this causes those blocks to get represented
twice in the disk usage tallying.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message