hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15208) DistCp to offer option to save src/dest filesets as alternative to delete()
Date Wed, 07 Feb 2018 02:45:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354886#comment-16354886

Steve Loughran commented on HADOOP-15208:

Patch 001

Adds a new option "xtrack <path>" (x to indicate experimental), which will save the
sequence files of (Text, CopyListingFileStatus) of both src and dest to a given path.

* bit of refactoring to ensure this action is the same as in -delete
* warnings in {{CopyListingFileStatus}} that this is utterly unstable and you can't expect
it to not break without warning. 
* {{AbstractContractDistCpTest}} tests for update and this new track option
* Added ones for local FS and HDFS in the distcp test module; helps get the tests working,
then the code. I had to copy the test resources local.xml and hdfs.xml from the relevant test
resource trees for that (alternative, put them in the production module, more significant
a change)
* clean up TestCopyCommitter, as it was not in a good state (assertEquals probes the wrong
way round, losing all exceptions by catching IOE and throwing a fail() without the inner cause,
etc. I have made  no other changes to the tests other than the cleanup. With the cleaned up
code, if a test fails, it'll be clearer why.

* docs. Should we?
* Are we happy with saving the low-level unstable structs. I am, provided everyone who sees
the warnings understands: this is not a persistence format, just for version-specific tracing
and tools.

I need this for a few reasons
* Get more traces of those real-word distcp operations which overload stores like S3
* Line up for HADOOP-15209
* Have the option for an experimental bulk delete hidden in hadoop-aws (HADOOP-15191) if really,
really needed. But I don't think it is needed, not if we can eliminate most of the delete()

> DistCp to offer option to save src/dest filesets as alternative to delete()
> ---------------------------------------------------------------------------
>                 Key: HADOOP-15208
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15208
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools/distcp
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-15208-001.patch
> There are opportunities to improve distcp delete performance and scalability with object
stores, but you need to test with production datasets to determine if the optimizations work,
don't run out of memory, etc.
> By adding the option to save the sequence files of source, dest listings, people (myself
included) can experiment with different strategies before trying to commit one which doesn't

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message