hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel
Date Thu, 10 Nov 2016 20:06:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654993#comment-15654993

Sahil Takiar commented on HADOOP-13600:

[~stevel@apache.org] I created a Pull Request: https://github.com/apache/hadoop/pull/157

Let me know what you think of my approach. I verified that the the S3 unit tests pass, but
have not run the integration tests yet.

The patch is pretty simple, but its different from the approach you outlined in HIVE-15093.
Below are some notes:

* A new method called {{copyFileAsync}} was added which returns a {{Copy}} object, the original
method {{copyFile}} is still there but it just invokes {{copyFileAsync(...).waitForCopyResult()}}
* Deletes are done inside the {{ProgressListener}}, I removed the logic in {{rename(...)}}
that issues bulk delete requests
** I'm assuming the {{ProgressListener}} is invoked by the same thread that is issuing the
copy request (correct me if I am wrong)
** The drawback is that more calls to S3 are made since delete ops aren't grouped together,
but the advantage is that deletes are now done across multiple threads
*** Let me know if you think this scales. Another benefit of my approach is that the logic
is much simpler. If we need bulk delete ops then some type of intermediate blocking queue
may be necessary
* I'm not entirely sure how to make the listing sequential, the API seems to suggest you have
to sequentially call {{listNextBatchOfObjects(...)}}

> S3a rename() to copy files in a directory in parallel
> -----------------------------------------------------
>                 Key: HADOOP-13600
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13600
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.7.3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
> Currently a directory rename does a one-by-one copy, making the request O(files * data).
If the copy operations were launched in parallel, the duration of the copy may be reducable
to the duration of the longest copy. For a directory with many files, this will be significant

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message