hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Mackrory (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13760) S3Guard: add delete tracking
Date Wed, 26 Apr 2017 17:51:04 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985252#comment-15985252
] 

Sean Mackrory commented on HADOOP-13760:
----------------------------------------

I did a deep-dive on what's happening when renaming a directory full of various nested directories,
files, empty directories, etc. Key things I learned:

* listFilesAndDirectories should really be named listFilesAndEmptyDirectories: the iterator
won't return separate items for all the non-empty directories. [~fabbri] suggested off-line
that we at least add a test for that to prevent it from being "fixed" in the future and we
should rename it too. I don't see a need to implement the list-all-filesystem-vertices function
right now. Now, this isn't a problem for its current uses: it was added so that S3GuardTool
didn't miss empty directories when importing, and the rest of the import process takes care
of the non-empty directories. And it just so happens that here it's behaving pretty much the
same as the request it's replacing (although it filters out tombstones, empty directories
don't end with a '/', etc), and that appears to be perfectly correct.

* the only increase in any metrics I could find is that listFilesAndDirectories will perform
a couple more list and object metadata requests than what we were doing before IFF S3Guard
is disabled. And we could avoid that if we go the route of having separate code in innerRename
to filter out tombstones, but my previous concerns still apply. I've got some workloads running
now to see how much the extra requests impact real performance on them. Will post details
when I have them.

To add to the functional testing, I ran a bunch of Hive-on-MR and Hive-on-Spark workloads
and everything still worked correctly.

> S3Guard: add delete tracking
> ----------------------------
>
>                 Key: HADOOP-13760
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13760
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Aaron Fabbri
>            Assignee: Sean Mackrory
>         Attachments: HADOOP-13760-HADOOP-13345.001.patch, HADOOP-13760-HADOOP-13345.002.patch
>
>
> Following the S3AFileSystem integration patch in HADOOP-13651, we need to add delete
tracking.
> Current behavior on delete is to remove the metadata from the MetadataStore.  To make
deletes consistent, we need to add a {{isDeleted}} flag to {{PathMetadata}} and check it when
returning results from functions like {{getFileStatus()}} and {{listStatus()}}.  In HADOOP-13651,
I added TODO comments in most of the places these new conditions are needed.  The work does
not look too bad.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message