hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Sreekumar <hsreeku...@clickable.com>
Subject Re: Using MultipleTextOutputFormat for map-only jobs
Date Fri, 15 Apr 2011 05:11:15 GMT
I changes jobConf.setMapOutputKeyClass(Text.class); to
jobConf.setMapOutputKeyClass(NullWritable.class);

Still no luck..

I also get this error in many mappers:

java.io.IOException: Failed to delete earlier output of task:
attempt_201104041514_0069_m_000003_0
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:110)
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
	at org.apache.hadoop.mapred.Task.done(Task.java:691)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)


On Fri, Apr 15, 2011 at 10:37 AM, Hari Sreekumar
<hsreekumar@clickable.com>wrote:

> Here's what I tried:
>
>   static class MapperClass extends MapReduceBase implements
>           Mapper<LongWritable, Text, NullWritable, Text> {
>     @Override
>     public void map(LongWritable key, Text value,
>             OutputCollector<NullWritable, Text> output, Reporter reporter)
>             throws IOException {
>       output.collect(
>               NullWritable.get(),
>               value);
>     }
>   }
>
>   static class SameFilenameOutputFormat extends
>           MultipleTextOutputFormat<NullWritable, Text> {
>
>     @Override
>     protected String getInputFileBasedOutputFileName(JobConf job, String name) {
>       String infilepath = job.get("map.input.file");
>       System.out.println("File path: " + infilepath);
>       if (infilepath == null) {
>         return name;
>       }
>       return new Path(infilepath).getName();
>     }
>
>
> And the config I set in the run() method:
>  JobConf jobConf = new JobConf(conf, this.getClass());
>
>     jobConf.setMapperClass(MapperClass.class);
>     jobConf.setNumReduceTasks(0);
>     jobConf.setMapOutputKeyClass(Text.class);
>     jobConf.setMapOutputValueClass(Text.class);
>     jobConf.setOutputKeyClass(NullWritable.class);
>     jobConf.setOutputValueClass(Text.class);
>     jobConf.setOutputFormat(SameFilenameOutputFormat.class);
>
> I do get output files with same names as input files, but I lose a lot of
> records. I get this exception and many tasks fail:
>
> 2011-04-15 10:23:53,090 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize
JVM Metrics with processName=MAP, sessionId= - already initialized
> 2011-04-15 10:23:53,139 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
> 2011-04-15 10:23:53,171 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop
library
> 2011-04-15 10:23:53,174 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully
loaded & initialized native-zlib library
> 2011-04-15 10:24:01,829 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201104041514_0068_m_000001_0
is done. And is in the process of commiting
> 2011-04-15 10:24:04,842 INFO org.apache.hadoop.mapred.TaskRunner: Task attempt_201104041514_0068_m_000001_0
is allowed to commit now
> 2011-04-15 10:24:05,405 WARN org.apache.hadoop.mapred.TaskRunner: Failure committing:
java.io.IOException: Failed to save output of task: attempt_201104041514_0068_m_000001_0
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
> 	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
> 	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
> 	at org.apache.hadoop.mapred.Task.done(Task.java:691)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 2011-04-15 10:24:11,846 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: Failed to save output of task: attempt_201104041514_0068_m_000001_0
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
> 	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
> 	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
> 	at org.apache.hadoop.mapred.Task.done(Task.java:691)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
> 2011-04-15 10:24:11,863 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for
the task
>
> I guess it has something to do with partitioning? Maybe the mappers are not
> simultaneously able to write to the same file or something of that sort?
>
> Thanks,
> Hari
>
> On Thu, Apr 14, 2011 at 6:37 PM, Hari Sreekumar <hsreekumar@clickable.com>wrote:
>
>> That is exactly what I do when I have a reduce phase, and it works. But in
>> case of map-only jobs, it doesn't work. I'll try overriding the
>> getOutputfileFromInputFile() method.
>>
>>
>> On Thu, Apr 14, 2011 at 5:19 PM, Harsh J <harsh@cloudera.com> wrote:
>>
>>> Hello again Hari,
>>>
>>> On Thu, Apr 14, 2011 at 5:10 PM, Hari Sreekumar
>>> <hsreekumar@clickable.com> wrote:
>>> > Here is a part of the code I am using:
>>> >     jobConf.setOutputFormat(MultipleTextOutputFormat.class);
>>>
>>> You need to subclass the OF and use it properly, else the abstract
>>> class takes over with the default name always used (Thus, 'part'). You
>>> can see a good, complete example at [1].
>>>
>>> I'd still recommend using MultipleOutputs for better portability
>>> reasons. Its javadocs explain how to go about using it well enough
>>> [2].
>>>
>>> [1] -
>>> https://sites.google.com/site/hadoopandhive/home/how-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat
>>> [2] -
>>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Mime
View raw message