hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Paliwal <ashishpaliwa...@gmail.com>
Subject Re: Hadoop MultiOutputs API Issue
Date Fri, 23 Dec 2016 09:45:27 GMT
Please share comments on mention issue.


On Wed, Dec 21, 2016 at 6:28 PM, Ashish Paliwal <ashishpaliwal83@gmail.com>

> Hi,
> Hadoop Map Reduce version: 2.2.0
> We are using MultiOutputs to write mullitple output files from Mapper(No
> reducer). As per requirement, multioutput should write in directory other
> than job's default output directory. So We used below MultiOutput method to
> write in different directory.
>  public <K, V> void
> <http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.java#>
> write(String
> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/String.java#String>
>  namedOutput, K key, V value,String
> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/String.java#String>
>  baseOutputPath)
> Now, if any Map task run for longer time, then (cause speculative
> execution enabled), hadoop start parallel task to complete task early. Now,
> both task trying to write in same directory in same file. Second task
> failed with "File already exists issue" and so Job.
> After analyzing it founds that, like default context writer, *MultiOutputs
> API does not create any temporary directory*. It directly starts writing
> into output directory. and the reason is FileOutputCommitter used by
> default context writer (and so Application Master) is different
> than MultiOutputs.writer. So in case of MultiOutput, none of the method of
> FileOutputCommitter is get called.
> So is it known issue or default behavior? And what is the solution for
> this problem?
> Regards,
> Ashish.

View raw message