hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohit Pegallapati (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-13023) Distcp with -update feature on first time raw data not working
Date Mon, 16 Apr 2018 02:36:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438921#comment-16438921
] 

Rohit Pegallapati edited comment on HADOOP-13023 at 4/16/18 2:35 AM:
---------------------------------------------------------------------

This looks inline with the intended behavior of -update option

[https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
{code:java}
-update is used to copy files from source that don’t exist at the target or differ from
the target version. -overwrite overwrites target-files that exist at the target.

The Update and Overwrite options warrant special attention since their handling of source-paths
varies from the defaults in a very subtle manner. Consider a copy from /source/first/ and /source/second/ to /target/,
where the source paths have the following contents:
hdfs://nn1:8020/source/first/1
hdfs://nn1:8020/source/first/2
hdfs://nn1:8020/source/second/10
hdfs://nn1:8020/source/second/20
When DistCp is invoked without -update or -overwrite, the DistCp defaults would create
directories first/ and second/, under /target. Thus:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
would yield the following contents in /target:
hdfs://nn2:8020/target/first/1
hdfs://nn2:8020/target/first/2
hdfs://nn2:8020/target/second/10
hdfs://nn2:8020/target/second/20
When either -update or -overwrite is specified, the *contents* of the source-directories
are copied to target, and not the source directories themselves. 
Thus:
distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target

would yield the following contents in /target:

hdfs://nn2:8020/target/1
hdfs://nn2:8020/target/2
hdfs://nn2:8020/target/10
hdfs://nn2:8020/target/20

{code}
Performed a small test with encryption zone to validate the above point
{code:java}
Path sourcePath = new Path(dfs.getWorkingDirectory(), "source");
		initData10(sourcePath);
		
		Path foo = new Path("/source/foo");
		dfs.mkdirs(foo);
		dfs.createEncryptionZone(foo, "test");
		String[] args = new String[] {"-update","/.reserved/raw"+source.toString(), "/.reserved/raw"+target.toString()
};
	    new DistCp(conf, OptionsParser.parse(args)).execute();
		RemoteIterator<EncryptionZone> listEncryptionZones = dfs.listEncryptionZones();
		while (listEncryptionZones.hasNext()) {
			System.out.println("Encryption Zone :: " + listEncryptionZones.next().getPath());
		}
{code}
This above code prints 2 encryption zones as I create the encryption zone on "foo"  a subdirectory
of the source directory. Here we can observe that the encryption zone of the subdirectory
is preserved at the target
{code:java}
Encryption Zone :: /source/foo
Encryption Zone :: /target/foo
{code}
On the other hand, the below code only prints one encryption zone as the encryption zone is
created directly on the source directory and not the subdirectory. 
{code:java}
Path sourcePath = new Path(dfs.getWorkingDirectory(), "source");
		initData10(sourcePath);
		
		dfs.createEncryptionZone(source, "test");
		String[] args = new String[] {"-update","/.reserved/raw"+source.toString(), "/.reserved/raw"+target.toString()
};
	    new DistCp(conf, OptionsParser.parse(args)).execute();
		RemoteIterator<EncryptionZone> listEncryptionZones = dfs.listEncryptionZones();
		while (listEncryptionZones.hasNext()) {
			System.out.println("Encryption Zone :: " + listEncryptionZones.next().getPath());
		}
{code}
{code:java}
Encryption Zone :: /source
{code}


was (Author: rohit.peg):
This looks inline with the intended behavior of -update option

[https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
{code:java}
-update is used to copy files from source that don’t exist at the target or differ from
the target version. -overwrite overwrites target-files that exist at the target.

The Update and Overwrite options warrant special attention since their handling of source-paths
varies from the defaults in a very subtle manner. Consider a copy from /source/first/ and /source/second/ to /target/,
where the source paths have the following contents:
hdfs://nn1:8020/source/first/1
hdfs://nn1:8020/source/first/2
hdfs://nn1:8020/source/second/10
hdfs://nn1:8020/source/second/20
When DistCp is invoked without -update or -overwrite, the DistCp defaults would create
directories first/ and second/, under /target. Thus:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
would yield the following contents in /target:
hdfs://nn2:8020/target/first/1
hdfs://nn2:8020/target/first/2
hdfs://nn2:8020/target/second/10
hdfs://nn2:8020/target/second/20
When either -update or -overwrite is specified, the *contents* of the source-directories
are copied to target, and not the source directories themselves. 
Thus:
distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target

would yield the following contents in /target:

hdfs://nn2:8020/target/1
hdfs://nn2:8020/target/2
hdfs://nn2:8020/target/10
hdfs://nn2:8020/target/20
{code}

> Distcp with -update feature on first time raw data not working
> --------------------------------------------------------------
>
>                 Key: HADOOP-13023
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13023
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools/distcp
>    Affects Versions: 2.6.0
>            Reporter: Mavin Martin
>            Priority: Major
>
> When attempting to do a distcp with the -update feature toggled on encrypted data, the
distcp shows as successful.  Reading the encrypted file on the target_path does not work since
the keyName does not exist.  
> Please see my example to reproduce the issue.
> {code}
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted                                DEF0000000000013
> [root@xxx bin]# hdfs dfs -ls -R /tmp
> drwxr-xr-x   - xxx xxx          0 2016-04-14 00:22 /tmp/a
> drwxr-xr-x   - xxx xxx          0 2016-04-14 00:00 /tmp/a/ted
> -rw-r--r--   3 xxx xxx         33 2016-04-14 00:00 /tmp/a/ted/test.txt
> [root@xxx bin]# hadoop distcp -update /.reserved/raw/tmp/a/ted /.reserved/raw/tmp/a-with-update/ted
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted                                DEF0000000000013
> [root@xxx bin]# hadoop distcp /.reserved/raw/tmp/a/ted /.reserved/raw/tmp/a-no-update/ted
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted                                DEF0000000000013
> /tmp/a-no-update/ted                      DEF0000000000013
> {code}
> The crypto zone for 'a-with-update' should have been created since this is a new destination.
 You can verify this by looking at 'a-no-update'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message