hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joep Rottinghuis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13002) distcp behaves differently through code compared to toolrunner invocation from command-line
Date Mon, 04 Apr 2016 23:12:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225238#comment-15225238
] 

Joep Rottinghuis commented on HADOOP-13002:
-------------------------------------------

The problem is that a new derived property is added to DistCpOptionSwitch 

Then in Distcp we have
{code}
  public int run(String[] argv) {
   ...
    try {
      inputOptions = (OptionsParser.parse(argv));
      setTargetPathExists();
      LOG.info("Input Options: " + inputOptions);
{code}
This doesn't get executed when calling distcp from a java program, because in that case a
deparate DistCpOptionSwitch gets created and DistCp is executed like so:
{code}
  Path targetPath = ...
  DistCpOptions options = new DistCpOptions(sourcePaths, targetPath);
  
  // Set desired options
  options.preserve(DistCpOptions.FileAttribute.BLOCKSIZE);
  options.preserve(DistCpOptions.FileAttribute.CHECKSUMTYPE);
  options.preserve(DistCpOptions.FileAttribute.REPLICATION);
  options.preserve(DistCpOptions.FileAttribute.TIMES);
  options.preserve(DistCpOptions.FileAttribute.USER);
  options.preserve(DistCpOptions.FileAttribute.GROUP);
  options.preserve(DistCpOptions.FileAttribute.PERMISSION);
  options.setBlocking(true);
  // -atomic defaults to false
  options.setAtomicCommit(false);
  // -update defaults to false
  options.setSyncFolder(false);
  
  Configuration conf = ...
  Job job = null;
  DistCp distCp = new DistCp(conf, options);
  job = distCp.execute();
{code}

in DistcpOptions targetPathExists defaults to true, the comment even suggests this:
{code}
  // targetPathExist is a derived field, it's initialized in the
  // beginning of distcp.
  private boolean targetPathExists = true;
{code}
The key problem is that it is initialized only from the run method, not in the execute method.

With the path not existing, but DistCp assuming it does exist when executed from code, the
following special handling fails to compute correctly in SimpleCopyListing.computeSourceRootPath
{code}
      boolean specialHandling =
          (options.getSourcePaths().size() == 1 && !targetPathExists)
              || options.shouldSyncFolder() || options.shouldOverwrite();
{code}
As a result the parent directory is prepended to the files in the generated sequence file
which results in the extra xyz/c directory in the target.

The least amount of code change would be to move the initialization in DistCP from run to
execute, however that has three disadvantages:
1) We'd also have to move the logging statement, otherwise an incorrectly initialized variable
is logged. That means that the invocation from Java suddenly does additional logging.
2) Having a derived method that could be explicitly set (to incorrect value) and then later
overwritten during execution is weird.
3) With hindsight we should have never introduced a setter for the derived property. It allows
the option and the associated property in the config to get out of sync. Moving the external
setting doesn't fix this.

I think the proper solution would be to remove the setTargetPathExists method from DistCp
and to move it to DistCpOptions and to invoke it from both constructors (but not the copy
constructor).
Now that the setTargetPathExists method was added in 2.5 we cannot remove it w/o breaking
backwards compatibility, but I think we should mark it as deprecated. We could also consider
making it a no-op given that the constructor already calculates it. Possibly we can make it
ignore the argument and re-calculate the exists value. I don't see the value in letting people
set an incorrect value (of path existing when it doesn't and visa versa).

However, the setting of the conf property has to remain in the distcp.execute method, because
the DistCpOptions class isn't a Configurable and will therefore not have access to getConf().
Passing the conf into the constructor of DistCpOptions would change its signature which we
cannot do.

> distcp behaves differently through code compared to toolrunner invocation from command-line
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13002
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13002
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools/distcp
>    Affects Versions: 2.5.0, 2.6.0, 2.7.0, 3.0.0
>            Reporter: Joep Rottinghuis
>
> In Hadoop 2.5 the behavior of distcp changed when called through code iff the target
directory did not exist and update wasn't used and atomic wasn't used.
> HADOOP-10459 introduced a change to preserve the root directory attributes. It introduced
a derivative property in the options as well as in the configuration whether the target path
exists. See https://github.com/apache/hadoop/commit/c5b59477775c797944db4992e8a70289ba2895ed
> However, this property is set only when distcp is used through the command line as a
ToolRunner in Distcp.run(String[] argv).
> The result is that when the target directory doesn't exist (and neither -update nor -atomic
options are used) SimplyCopyListing incorrectly assumes that the target directory does exist
because the attribute defaults to true. Copying directory a/b/c to xyz results in the creation
of a xyx/c directory with the content of c in it, rather than the content of c getting copied
into directory xyz directly.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message