hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3041) Within a task, the value ofJobConf.getOutputPath() method is modified
Date Tue, 18 Mar 2008 13:45:30 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579846#action_12579846

Devaraj Das commented on HADOOP-3041:

Alejandro, the reason for modifying the job's output dir is to let user apps transparently
deal with things like creation of side files in the task's output directory, and, speculative
tasks creating the same output files. Another reason is that the getOutputPath can be used
(and is usually used) in the OutputFormat implementation. All user code could use getOutputPath
and create task specific stuff there and the framework automatically promotes/discards these
files upon successful/failed task completion. Look at the JavaDoc in JobConf.getOutputPath()
to get a clear explanation of what i am trying to say (by the way this doc needs to be fixed
to include _temporary).
You are facing the problem since you create a directory in the _same level_ as the _actual_
output directory of the job. One way to address your problem is to provide an additional API
like JobConf.getConfiguredOutputPath that would internally do things like getOutputPath.getParent(),
etc. and return you the actual configured directory. This will ensure that your apps don't
break when the framework changes the directory structure of the output path, etc. Not the
best solution but we have to arrive at a compromise between your requirement and what we already
document and provide. Thoughts?

> Within a task, the value ofJobConf.getOutputPath() method is modified
> ---------------------------------------------------------------------
>                 Key: HADOOP-3041
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3041
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.1
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Priority: Blocker
>             Fix For: 0.16.2
> Until 0.16.0 the value of the getOutputPath() method, if queried within a task, pointed
to the part file assigned to the task. 
> For example: /user/foo/myoutput/part_00000
> In 0.16.1, now it returns an internal hadoop for the task output temporary location.
> For the above example: /user/foo/myoutput/_temporary/part_00000
> This change breaks applications that use the getOutputPath() to compute other directories.
> IMO, this has always being broken, Hadoop should not change the values of properties
injected by the client, instead it should use private properties or internal helper methods.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message