hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Fabbri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories
Date Fri, 09 Mar 2018 02:37:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392299#comment-16392299
] 

Aaron Fabbri commented on HADOOP-15209:
---------------------------------------

Noticed you just mentioned cancelling the patch, nevermind on my last "is it ready" comment.

My first feedback is about CopyCommitter#deleteMissing().  The goal seems to be to reduce
no-op deletes, but you have 3 retries with 1 second sleeps on failed deletes.  Ideally we'd
only do that for S3, or add a config flag (default false) to enable retries there.  Really
we should be able to query the FS for capabilities and do retry for eventual consistent stores.

Just ping me when you think this is ready to commit and I'll re-review.

> DistCp to eliminate needless deletion of files under already-deleted directories
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-15209
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15209
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, HADOOP-15209-003.patch,
HADOOP-15209-004.patch, HADOOP-15209-005.patch, HADOOP-15209-006.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted directory.
This generates needless load on filesystems/object stores, and, if the store throttles delete,
can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, then it will
know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not overload
the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message