hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3772) MultipleOutputs output lost if baseOutputPath starts with ../
Date Wed, 19 Dec 2012 17:07:13 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536133#comment-13536133

Alejandro Abdelnur commented on MAPREDUCE-3772:

MultipleOutputs was implemented to work properly when speculative execution enable with FileOutputFormat
implementations (Text, SequenceFile). FileOutputFormats, to handle speculative execution,
write the output to the following path *$mapred.out.dir/\_temporary/\_$taskid* while execution.
If speculative execution is in progress for a given task, there will be 2 tasks IDs for it,
this means that while the 'competing' tasks are running their outputs go to different directories.
When the first speculative task completes, its output will be committed (moved to the *$mapred.out.dir*)
and the second speculative task will be discarded, as well as its output. MultipleOutputs
creates the named outputs under the *_$taskid* directory, thus leveraging all the speculative
execution functionality and behavior implemented by FileOutputFormat. If the named output
file is not within the *_$taskid* directory, then all the logic just described does not work
as the task commit procedure is done only from the *_$taskid* directory to the *$mapred.out.dir*.

Because of this I think that Priyo suggestion of logging a warning makes sense. There is caveat
to this, using MO.write(K,V,NamedOutputPath) method, the warning would be logged in the task
log that is creating the named output with an absolute path. 

> MultipleOutputs output lost if baseOutputPath starts with ../
> -------------------------------------------------------------
>                 Key: MAPREDUCE-3772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.20.2
>            Reporter: Radim Kolar
>            Assignee: Harsh J
>         Attachments: MAPREDUCE-3772.patch
> Lets say you have output directory set:
> FileOutputFormat.setOutputPath(job, "/tmp/multi1/out");
> and want to place output from MultipleOutputs into /tmp/multi1/extra
> I expect following code to work:
> mos = new MultipleOutputs<Text, IntWritable>(context);
> mos.write(new Text("zrr"), value, "../extra/");
> but no Exception is throw and expected output directory /tmp/multi1/extra does not even
exists. All data written to this output vanish without trace.
> To make it work fullpath must be used
> mos.write(new Text("zrr"), value, "/tmp/multi1/extra/");
> Output is listed in statistics from MultipleOutputs correctly:
>         org.apache.hadoop.mapreduce.lib.output.MultipleOutputs
>                 ../gaja1/=13333 (* everything is lost *)
>                 /tmp/multi1/out/../ksd34/=13333 (* this using full path works *)
>                 list1=6667

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message