hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingliang Liu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13489) DistCp may incorrectly return success status when the underlying Job failed
Date Fri, 12 Aug 2016 22:43:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419635#comment-15419635
] 

Mingliang Liu commented on HADOOP-13489:
----------------------------------------

Thanks for reporting and working on this, [~tedyu].

The v1 patch checks the job not to be null value and job state to be DistCpConstants.SUCCEEDED
by directly calling {{isSuccessful()}}. My concerns are:
# Basically the {{DistCp#run()}} checks the status of the job by exception handling. All exceptions,
if not caught yet, will be caught by the last generic {{Exception}} block which returns DistCpConstants.UNKNOWN_ERROR.
Is it possible that the job is null while there is no exception? (I think here the job should
never be null)
#- If not, the {{if (job != null) {}} check is not needed, or better, state as an assert.
#- If true, I think for a null job, we should still return DistCpConstants.UNKNOWN_ERROR instead
of DistCpConstants.SUCCESS.
#  We have to respect the {{DistCpOptions#blocking}} option ({{-async}} from command line
parameters):
#- If the distcp option is blocking, {{DistCp#waitForJobCompletion()}} will be called, which
will throw an IOE in case the job is not successful.
{code}
  /**
   * Wait for the given job to complete.
   * @param job the given mapreduce job that has already been submitted
   */
  public void waitForJobCompletion(Job job) throws Exception {
    assert job != null;
    if (!job.waitForCompletion(true)) {
      throw new IOException("DistCp failure: Job " + job.getJobID()
          + " has failed: " + job.getStatus().getFailureInfo());
    }
  }
{code}
So if the Job fails, DistCpConstants.SUCCESS is NOT be returned?
#- If the distcp option is non-blocking, {{waitForJobCompletion()}} will not be called, in
which case the distcp just returns after the job is submitted. The internal job state can
still be RUNNING instead of SUCCEEDED. What should DistCp return in this case? I think we
can simply return SUCCESS, as the existing code does.

FWIW, the {{@return}} javadoc of the {{DistCp#run()}} method was wrong for years.

> DistCp may incorrectly return success status when the underlying Job failed
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-13489
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13489
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>         Attachments: HADOOP-13489.v1.patch, HADOOP-13489.v2.patch, TestIncrementalBackup-output.txt
>
>
> I was troubleshooting HBASE-14450 where at the end of BackupdistCp#execute(), distcp
job was marked unsuccessful (BackupdistCp is a wrapper of DistCp).
> Yet in IncrementalTableBackupProcedure#incrementalCopy(), the return value from copyService.copy()
was 0.
> Here is related code from DistCp:
> {code}
>     try {
>       execute();
>     } catch (InvalidInputException e) {
>       LOG.error("Invalid input: ", e);
>       return DistCpConstants.INVALID_ARGUMENT;
>     } catch (DuplicateFileException e) {
>       LOG.error("Duplicate files in input path: ", e);
>       return DistCpConstants.DUPLICATE_INPUT;
>     } catch (AclsNotSupportedException e) {
>       LOG.error("ACLs not supported on at least one file system: ", e);
>       return DistCpConstants.ACLS_NOT_SUPPORTED;
>     } catch (XAttrsNotSupportedException e) {
>       LOG.error("XAttrs not supported on at least one file system: ", e);
>       return DistCpConstants.XATTRS_NOT_SUPPORTED;
>     } catch (Exception e) {
>       LOG.error("Exception encountered ", e);
>       return DistCpConstants.UNKNOWN_ERROR;
>     }
>     return DistCpConstants.SUCCESS;
> {code}
> We don't check whether the Job returned by execute() was successful.
> Even if the Job fails, DistCpConstants.SUCCESS is returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message