hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel
Date Thu, 14 Sep 2017 03:54:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165693#comment-16165693
] 

ASF GitHub Bot commented on HADOOP-13600:
-----------------------------------------

Github user sahilTakiar commented on a diff in the pull request:

    https://github.com/apache/hadoop/pull/157#discussion_r138791721
  
    --- Diff: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
---
    @@ -891,50 +902,123 @@ private boolean innerRename(Path source, Path dest)
           }
     
           List<DeleteObjectsRequest.KeyVersion> keysToDelete = new ArrayList<>();
    +      List<DeleteObjectsRequest.KeyVersion> dirKeysToDelete = new ArrayList<>();
           if (dstStatus != null && dstStatus.isEmptyDirectory() == Tristate.TRUE)
{
             // delete unnecessary fake directory.
             keysToDelete.add(new DeleteObjectsRequest.KeyVersion(dstKey));
           }
     
    -      Path parentPath = keyToPath(srcKey);
    -      RemoteIterator<LocatedFileStatus> iterator = listFilesAndEmptyDirectories(
    -          parentPath, true);
    -      while (iterator.hasNext()) {
    -        LocatedFileStatus status = iterator.next();
    -        long length = status.getLen();
    -        String key = pathToKey(status.getPath());
    -        if (status.isDirectory() && !key.endsWith("/")) {
    -          key += "/";
    -        }
    -        keysToDelete
    -            .add(new DeleteObjectsRequest.KeyVersion(key));
    -        String newDstKey =
    -            dstKey + key.substring(srcKey.length());
    -        copyFile(key, newDstKey, length);
    -
    -        if (hasMetadataStore()) {
    -          // with a metadata store, the object entries need to be updated,
    -          // including, potentially, the ancestors
    -          Path childSrc = keyToQualifiedPath(key);
    -          Path childDst = keyToQualifiedPath(newDstKey);
    -          if (objectRepresentsDirectory(key, length)) {
    -            S3Guard.addMoveDir(metadataStore, srcPaths, dstMetas, childSrc,
    -                childDst, username);
    +      // A blocking queue that tracks all objects that need to be deleted
    +      BlockingQueue<Optional<DeleteObjectsRequest.KeyVersion>> deleteQueue
= new ArrayBlockingQueue<>(
    +              (int) Math.round(MAX_ENTRIES_TO_DELETE * 1.5));
    +
    +      // Used to track if the delete thread was gracefully shutdown
    +      boolean deleteFutureComplete = false;
    +      FutureTask<Void> deleteFuture = null;
    +
    +      try {
    +        // Launch a thread that will read from the deleteQueue and batch delete any files
that have already been copied
    +        deleteFuture = new FutureTask<>(() -> {
    +          while (true) {
    +            while (keysToDelete.size() < MAX_ENTRIES_TO_DELETE) {
    +              Optional<DeleteObjectsRequest.KeyVersion> key = deleteQueue.take();
    +
    +              // The thread runs until is is given an EOF message (an Optional#empty())
    +              if (key.isPresent()) {
    --- End diff --
    
    I removed the usage of `Optional`. Using a `private static final DeleteObjectsRequest.KeyVersion
END_OF_KEYS_TO_DELETE = new DeleteObjectsRequest.KeyVersion(null, null);` as the EOF instead.


> S3a rename() to copy files in a directory in parallel
> -----------------------------------------------------
>
>                 Key: HADOOP-13600
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13600
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.7.3
>            Reporter: Steve Loughran
>            Assignee: Sahil Takiar
>         Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request O(files * data).
If the copy operations were launched in parallel, the duration of the copy may be reducable
to the duration of the longest copy. For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message