hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ewan Higgs (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories
Date Tue, 13 Mar 2018 15:58:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397153#comment-16397153
] 

Ewan Higgs commented on HADOOP-15209:
-------------------------------------

I tried this out on a directory with 6995 files (a hadoop distribution binary release), writing
to an S3A compatible storage, and it appears to work. The delete part was fairly quick and
only logged deletes at the directory level, letting the FileSystem perform the delete for
everything with the directory's prefix.

Then I cleaned out the source directory and ran distcp again - and it correctly elided all
of the deletions but the top level:
{quote}{{Deleted from target: files: 0 directories: 1; skipped deletions 6995; deletions already
missing 0; failed deletes 0}}{quote}
As an aside, I seem to be unable to find where the DistCp counters are formatted such that
BANDWITH_IN_BYTES becomes "Bandwidth in Btyes":
{quote}{{        DistCp Counters                                                 
 }}
{{                Bandwidth in Btyes=189349                             
 }}
{{                Bytes Copied=312048557                            
 }}
{{                Bytes Expected=312048557                      
 }}
{{                Files Copied=6155                                    
 }}
{{                DIR_COPY=841 }}{quote}
 

> DistCp to eliminate needless deletion of files under already-deleted directories
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-15209
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15209
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, HADOOP-15209-003.patch,
HADOOP-15209-004.patch, HADOOP-15209-005.patch, HADOOP-15209-006.patch, HADOOP-15209-007.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted directory.
This generates needless load on filesystems/object stores, and, if the store throttles delete,
can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, then it will
know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not overload
the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message