hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Priyo Mustafi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3772) MultipleOutputs output lost if baseOutputPath starts with ../
Date Thu, 29 Nov 2012 01:09:58 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506107#comment-13506107
] 

Priyo Mustafi commented on MAPREDUCE-3772:
------------------------------------------

MultipleOutputs exposes to methods.
  1) public <K,V> void write(String namedOutput,K key,V value)
  2) public <K,V> void write(String namedOutput,K key,V value,String baseOutputPath)
where
  namedOutput - the named output name
  baseOutputPath - base-output path to write the record to. Note: Framework will generate
unique filename for the baseOutputPath 
  
We use the second one which allows you to provide a baseOutputPath where the data needs to
be written.  I don't see anywhere in the javadoc which mentions that baseOutputPath shouldn't
be a fully qualified path.  So the Jira is definitely valid.  Either the Javadoc needs to
be fixed or the code needs to be fixed and I would prefer the latter as we have developed
extensive data-pipelines based on this.  If it is not fixed, we have to change the absolute
paths to sub-directory paths and then once the job is done, move all those directories out
to the expected locations.

Aside that, if we provide baseOutputPath as "abc/def/xyz" then it puts the directory under
the main output directory i.e. you get files like this  <main-output-dir>/abc/def/xyz-r-00000.
  Instead if you use baseOutputPath as "/abc/def/xyz" where the path isn't a subdirectory
of the main output directory, then the problem is seen.  



                
> MultipleOutputs output lost if baseOutputPath starts with ../
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-3772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv1
>    Affects Versions: 0.20.203.0, 0.22.0
>         Environment: FreeBSD
>            Reporter: Radim Kolar
>
> Lets say you have output directory set:
> FileOutputFormat.setOutputPath(job, "/tmp/multi1/out");
> and want to place output from MultipleOutputs into /tmp/multi1/extra
> I expect following code to work:
> mos = new MultipleOutputs<Text, IntWritable>(context);
> mos.write(new Text("zrr"), value, "../extra/");
> but no Exception is throw and expected output directory /tmp/multi1/extra does not even
exists. All data written to this output vanish without trace.
> To make it work fullpath must be used
> mos.write(new Text("zrr"), value, "/tmp/multi1/extra/");
> Output is listed in statistics from MultipleOutputs correctly:
>         org.apache.hadoop.mapreduce.lib.output.MultipleOutputs
>                 ../gaja1/=13333 (* everything is lost *)
>                 /tmp/multi1/out/../ksd34/=13333 (* this using full path works *)
>                 list1=6667

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message