hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-15191) Add Private/Unstable BulkDelete operations to supporting object stores for DistCP
Date Tue, 30 Jan 2018 02:01:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344385#comment-16344385
] 

Steve Loughran edited comment on HADOOP-15191 at 1/30/18 2:00 AM:
------------------------------------------------------------------

h2. Proposed

* New interface {{org.apache.hadoop.fs.store.BulkIO}}
* S3A to implement this, relaying to {{S3ABulkOperations}}
* {{S3ABulkOperations}} to implement an optimised delete

If you look at the cost of the delete(file), it's not just the DELETE call its:

# getFileStatus(file) : HEAD, [HEAD], [LIST].
# DELETE
# getFileStatus(file.parent) HEAD, HEAD, LIST.
# if not found, PUT file.parent + "/"


FWIW, we could maybe optimise that second getFileStatus in the assumption that there's no
file or dir marker there; all you need to do is check for the LIST call returning 1+ entry.

Anyway. you are looking at ~7 HTTP requests per delete. 

Optimising that directory creation is equally important. Now, we could just have the bulk
IO operation say "outcome of empty directories is undefined". I'm happy with that, but it's
more of a change to the observable outcome of a distcp call.

New {{S3ABulkOperations.bulkDeleteFiles}}

* No check for a file existing before delete
* Issues a bulk delete with the configured page size
* builds up a tree of parent paths, and only attempts to creates fake directories for the
parent directories at the bottom of the tree.

That is, if you delete the paths
{code}
/A/B.txt
/A/C/D.txt
/A/C/E.txt
{code}

Then the only directory to consider creating is /A/C/; after which you know that the parent
/A path will have an entry, so doesn't need
any work. The number of fake directory creation therefore goes from O(files) to O(leaves in
directory tree). At best,  Ω(1), at worst O(files).

One caveat: we now create an empty dir even if the source file doesn't exist.


h2. Testing

I've made the page size configurable (fs.s3a.experimental.bulkdelete.pagesize). We can switch
on the paged delete mode with a very small page size, and so check it works properly even
for a small number of files.

New unit test suite {{TestS3ABulkOperations}}, primarily checks tree logic for the directory
creation process.

New integration test suite {{ITestS3ABulkOperations}} performs bulk IO and sees what it does.

The existing {{AbstractContractDistCpTest}} test extends its {{deepDirectoryStructureToRemote}}
test to become {{deepDirectoryStructureToRemoteWithSync}}, 
doing an update with some files added, some removed, and assertions about the final state.
This verifies that distcp is happy. I've also reviewed the logs
to see that all is well there.

h2. Alternate Design: publish summary and do it independently

The other tactic for doing this would be to not integrate DistCP with the bulk delete, and
instead have it publish the files of input & output for a followup reconciler.

Good: 

* No changes to DistCP delete process
* No need to add any explicit API/interface in hadoop-common

Bad:

* New visible option to distcp to save output
* May lead to expectations of future maintenance of the option
* and also a persistent format for the data

You'd still need to add the bulk delete calls alongside the S3A Fs, and any other stores to
which the bulk IO was also added (Wasb could save on directory setup, by the look of things,
as would oss: and swift




was (Author: stevel@apache.org):
h2. Proposed

* New interface {{org.apache.hadoop.fs.store.BulkIO}}
* S3A to implement this, relaying to {{S3ABulkOperations}}
* {{S3ABulkOperations}} to implement an optimised delete

If you look at the cost of the delete(file), it's not just the DELETE call its:

# getFileStatus(file) : HEAD, [HEAD], [LIST].
# DELETE
# getFileStatus(file.parent) HEAD, HEAD, LIST.
# if not found, PUT file.parent + "/"


FWIW, we could maybe optimise that second getFileStatus in the assumption that there's no
file or dir marker there; all you need to do is check for the LIST call returning 1+ entry.

Anyway. you are looking at ~7 HTTP requests per delete. 

Optimising that directory creation is equally important. Now, we could just have the bulk
IO operation say "outcome of empty directories is undefined". I'm happy with that, but it's
more of a change to the observable outcome of a distcp call.

New {{S3ABulkOperations.bulkDeleteFiles}}

* No check for a file existing before delete
* Issues a bulk delete with the configured page size
* builds up a tree of parent paths, and only attempts to creates fake directories for the
parent directories at the bottom of the tree.

That is, if you delete the paths
{code}
/A/B.txt
/A/C/D.txt
/A/C/E.txt
{code}

Then the only directory to consider creating is /A/C/; after which you know that the parent
/A path will have an entry, so doesn't need
any work. The number of fake directory creation therefore goes from O(files) to O(leaves in
directory tree). At best,  Ω(1), at worst O(files).

One caveat: we now create an empty dir even if the source file doesn't exist.


h2. Testing

I've made the page size configurable (fs.s3a.experimental.bulkdelete.pagesize). We can switch
on the paged delete mode with a very small page size, and so check it works properly even
for a small number of files.

New unit test suite {{TestS3ABulkOperations}}, primarily checks tree logic for the directory
creation process.

New integration test suite {{ITestS3ABulkOperations}} performs bulk IO and sees what it does.

The existing {{AbstractContractDistCpTest}} test extends its {{deepDirectoryStructureToRemote}}
test to become {{deepDirectoryStructureToRemoteWithSync}}, 
doing an update with some files added, some removed, and assertions about the final state.
This verifies that distcp is happy. I've also reviewed the logs
to see that all is well there.

h2. Alternate Design: publish summary and do it independently

The other tactic for doing this would be to not integrate DistCP with the bulk delete, and
instead have it publish the files of input & output for a followup reconciler.

Good: 

* No changes to DistCP delete process
* No need to add any explicit API/interface in hadoop-common

Bad:

* New visible option to distcp to save output
* May lead to expectations of future maintenance of the option
* and also a persistent format for the data

You'd still need to add the bulk delete calls alongside the S3A Fs, and any other stores to
which the bulk IO was also added (Wasb could save on directory setup, by the look of things,
as would oss: and swift:)



> Add Private/Unstable BulkDelete operations to supporting object stores for DistCP
> ---------------------------------------------------------------------------------
>
>                 Key: HADOOP-15191
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15191
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3, tools/distcp
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15191-001.patch, HADOOP-15191-002.patch
>
>
> Large scale DistCP with the -delete option doesn't finish in a viable time because of
the final CopyCommitter doing a 1 by 1 delete of all missing files. This isn't randomized
(the list is sorted), and it's throttled by AWS.
> If bulk deletion of files was exposed as an API, distCP would do 1/1000 of the REST calls,
so not get throttled.
> Proposed: add an initially private/unstable interface for stores, {{BulkDelete}} which
declares a page size and offers a {{bulkDelete(List<Path>)}} operation for the bulk
deletion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message