hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-6152) distcp V2 doesn't preserve root dir's attributes when -p is specified
Date Tue, 25 Mar 2014 02:26:43 GMT

     [ https://issues.apache.org/jira/browse/HDFS-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Yongjun Zhang updated HDFS-6152:

    Status: Patch Available  (was: Open)

The submitted patch tries to address the two issues reported.

Some notable changes:

1.  A new boolean field "targetPathExists" is introduced to DistCpOptions class. The value
of this field is a derived by checking whether the target path exists or not in the beginning
of distcp.  (Arguably, this information could be put somewhere else, but I found DistCpOption
is the most suitable place based on the current DistCp implementation). 

A new corresponding jobconf property CONF_LABEL_TARGET_PATH_EXISTS is introduced, and it's
initialized at the same time as the targetPathExists field.

The reason is that the result of class SimpleCopyListing's method computeSourceRootPath depends
on DistCpOption. E.g., whether the distcp target exists or not, whether -update or -overwrite
switches are passed. And Item 3 below needs this info (via the new jobconf property). 

Unit tests that use DistCpOptions need to be aware of the need to set this filed according
to the test-case's setting.

2. For the issues reported in this JIRA, an entry that  was skipped by writeToFileListing
method with the following code:
 if (fileStatus.getPath().equals(sourcePathRoot) && fileStatus.isDirectory())
    return; // Skip the root-paths.
is now added to the filelisting when no -update/-overwrite is specified.

This entry is recognized by both the CopyMapper and CopyCommitter.
Using this entry, the CopyMapper will create dir accordingly (for ISSUE 2), and the CopyCommitter
will update attributes when specified (for ISSUE 1).

E.g., distcp a/b xyz, where a/b is the source dir, 
a. if xyz doesn't exist, then "a/b" is written to the copyListing with empty relative path
b. if xyz exists, then "a/b" is written to the copyListing with relative path "b".

class CopyCommitter's method deleteMissing creates a DistCpOption object with default setting,
and collect listing from prior-to-committing result of distcp. This is not sufficient for
the above mentioned reason  (The result of class SimpleCopyListing's method computeSourceRootPath
depends on DistCpOption). The problem is revealed with the change I added to fix this JIRA,
and the patch I submitted addressed it.

Thanks for reviewing.

> distcp V2 doesn't preserve root dir's attributes when -p is specified
> ---------------------------------------------------------------------
>                 Key: HDFS-6152
>                 URL: https://issues.apache.org/jira/browse/HDFS-6152
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.3.0
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-6152.001.patch
> Two issues were observed with distcpV2
> ISSUE 1. when copying a source dir to target dir with "-pu" option using command 
>   "distcp -pu source-dir target-dir"
> The source dir's owner is not preserved at target dir. Simiarly other attributes of source
dir are not preserved.  Supposedly they should be preserved when no -update and no -overwrite
> There are two scenarios with the above command:
> a. when target-dir already exists. Issuing the above command will  result in target-dir/source-dir
(source-dir here refers to the last component of the source-dir path in the command line)
at target file system, with all contents in source-dir copied to under target-dir/src-dir.
The issue in this case is, the attributes of src-dir is not preserved.
> b. when target-dir doesn't exist. It will result in target-dir with all contents of source-dir
copied to under target-dir. This issue in this  case is, the attributes of source-dir is not
carried over to target-dir.
> For multiple source cases, e.g., command 
>   "distcp -pu source-dir1 source-dir2 target-dir"
> No matter whether the target-dir exists or not, the multiple sources are copied to under
the target dir (target-dir is created if it didn't exist). And their attributes are preserved.

> ISSUE 2. with the following command:
>   "distcp source-dir target-dir"
> when source-dir is an empty directory, and when target-dir doesn't exist, source-dir
is not copied, actually the command behaves like a no-op. However, when the source-dir is
not empty, it would be copied and results in target-dir at the target file system containing
a copy of source-dir's children.
> To be consistent, empty source dir should be copied too. Basically the  above distcp
command should cause target-dir get created at target file system, and the source-dir's attributes
are preserved at target-dir when -p is passed.

This message was sent by Atlassian JIRA

View raw message