hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories
Date Fri, 09 Mar 2018 11:21:01 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392729#comment-16392729
] 

Steve Loughran commented on HADOOP-15209:
-----------------------------------------

I cancelled because I want to simplify that retry logic. By removing it :). It's too complex.
and I don't see what it delivers

I've looked through all the delete() calls and apart from ftp oddness, all filesystems only
return false on delete() if the dir wasn't there.

So the algorithm I want is
! delete() => log.info() & continue

But also: have the -i for ignoreErrors also work for this phase too; any exception from delete

I'm also not seeing that stack trace above when I turn on inconsistent s3a listings with the
old code. It's complaining about duplicate entries, which could be something up with our simulated
listings, or something else.

> DistCp to eliminate needless deletion of files under already-deleted directories
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-15209
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15209
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, HADOOP-15209-003.patch,
HADOOP-15209-004.patch, HADOOP-15209-005.patch, HADOOP-15209-006.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted directory.
This generates needless load on filesystems/object stores, and, if the store throttles delete,
can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, then it will
know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not overload
the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message