hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Mackrory (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10797) Disk usage summary of snapshots causes renamed blocks to get counted twice
Date Wed, 05 Oct 2016 16:59:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15549321#comment-15549321
] 

Sean Mackrory commented on HDFS-10797:
--------------------------------------

Thanks, [~xiaochen]. Except as noted below, I'll incorporate all your feedback into another
patch...

{quote}I don't think that will be a critical path to impact du performance{quote}

Yeah - not sure if anything performance critical depends on du, but I would think correctness
of the final result is far more important here anyway.

{quote}In nodeIncluded, we safeguard includedNodes in a synchronized block, but we also provide
a getIncludedNodes method, which could potentially be updated by the caller. No real usage
yet, but I just feel this a bit unsafe in general, maybe return a clone of it instead?{quote}

So my concern was not that the contents of the HashSet instance might change, but that the
reference 'counts' temporarily points to a different object entirely when tallying the deleted,
snapshotted INodes. Rather than protecting the data structures, it ensures no one can call
getCounts() while counts would point to the wrong object. Beyond that, I think it's just as
likely that threads calling getCounts in parallel will need their changes to propagate to
the rest of the program, meaning the correct solution would be a thread-safe data structure
rather than a clone. So I do think it's best to leave it as is until there is a use case for
other concurrent accesses.

> Disk usage summary of snapshots causes renamed blocks to get counted twice
> --------------------------------------------------------------------------
>
>                 Key: HDFS-10797
>                 URL: https://issues.apache.org/jira/browse/HDFS-10797
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>         Attachments: HDFS-10797.001.patch, HDFS-10797.002.patch, HDFS-10797.003.patch,
HDFS-10797.004.patch, HDFS-10797.005.patch, HDFS-10797.006.patch, HDFS-10797.007.patch, HDFS-10797.008.patch,
HDFS-10797.009.patch
>
>
> DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how much disk
usage is used by a snapshot by tallying up the files in the snapshot that have since been
deleted (that way it won't overlap with regular files whose disk usage is computed separately).
However that is determined from a diff that shows moved (to Trash or otherwise) or renamed
files as a deletion and a creation operation that may overlap with the list of blocks. Only
the deletion operation is taken into consideration, and this causes those blocks to get represented
twice in the disk usage tallying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message