hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Priyo Mustafi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3772) MultipleOutputs output lost if baseOutputPath starts with ../
Date Wed, 21 Nov 2012 20:11:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502267#comment-13502267

Priyo Mustafi commented on MAPREDUCE-3772:

We are seeing the same issue in 0.20.203.  We are seeing partial data loss.  We write 100's
of GB and we are seeing 7-10% loss in the written records compared to what is reported by
the MultipleOutputs counters.   Sometimes we are seeing 0 sized sequencefiles as well which
is invalid. That is how we first noted the problem as our jobs sometimes will get EOFException
while reading SequenceFile.

The issue happens when
1) MultipleOutputs writes data outside of the main output directory of the reducer i.e. by
giving an absolute path which is not inside the main output directory.  Which matches with
the above comment by Radim as he is using ../ which is moving the multipleoutputs directory
outside of the main output directory.

2) It happens only when speculative execution is turned on for the reducer.  Without speculative
execution, everything works whether directory is inside or outside.

By the way, my code used multipleoutputs only in the reducer so not sure if the same problem
exists for mapper.

> MultipleOutputs output lost if baseOutputPath starts with ../
> -------------------------------------------------------------
>                 Key: MAPREDUCE-3772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv1
>    Affects Versions: 0.22.0
>         Environment: FreeBSD
>            Reporter: Radim Kolar
>            Priority: Minor
> Lets say you have output directory set:
> FileOutputFormat.setOutputPath(job, "/tmp/multi1/out");
> and want to place output from MultipleOutputs into /tmp/multi1/extra
> I expect following code to work:
> mos = new MultipleOutputs<Text, IntWritable>(context);
> mos.write(new Text("zrr"), value, "../extra/");
> but no Exception is throw and expected output directory /tmp/multi1/extra does not even
exists. All data written to this output vanish without trace.
> To make it work fullpath must be used
> mos.write(new Text("zrr"), value, "/tmp/multi1/extra/");
> Output is listed in statistics from MultipleOutputs correctly:
>         org.apache.hadoop.mapreduce.lib.output.MultipleOutputs
>                 ../gaja1/=13333 (* everything is lost *)
>                 /tmp/multi1/out/../ksd34/=13333 (* this using full path works *)
>                 list1=6667

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message