hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Sreekumar <hsreeku...@clickable.com>
Subject Re: Using MultipleTextOutputFormat for map-only jobs
Date Fri, 15 Apr 2011 05:07:45 GMT
Here's what I tried:

  static class MapperClass extends MapReduceBase implements
          Mapper<LongWritable, Text, NullWritable, Text> {
    @Override
    public void map(LongWritable key, Text value,
            OutputCollector<NullWritable, Text> output, Reporter reporter)
            throws IOException {
      output.collect(
              NullWritable.get(),
              value);
    }
  }

  static class SameFilenameOutputFormat extends
          MultipleTextOutputFormat<NullWritable, Text> {

    @Override
    protected String getInputFileBasedOutputFileName(JobConf job, String name) {
      String infilepath = job.get("map.input.file");
      System.out.println("File path: " + infilepath);
      if (infilepath == null) {
        return name;
      }
      return new Path(infilepath).getName();
    }


And the config I set in the run() method:
 JobConf jobConf = new JobConf(conf, this.getClass());

    jobConf.setMapperClass(MapperClass.class);
    jobConf.setNumReduceTasks(0);
    jobConf.setMapOutputKeyClass(Text.class);
    jobConf.setMapOutputValueClass(Text.class);
    jobConf.setOutputKeyClass(NullWritable.class);
    jobConf.setOutputValueClass(Text.class);
    jobConf.setOutputFormat(SameFilenameOutputFormat.class);

I do get output files with same names as input files, but I lose a lot of
records. I get this exception and many tasks fail:

2011-04-15 10:23:53,090 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Cannot initialize JVM Metrics with processName=MAP, sessionId= -
already initialized
2011-04-15 10:23:53,139 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2011-04-15 10:23:53,171 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library
2011-04-15 10:23:53,174 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
2011-04-15 10:24:01,829 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_201104041514_0068_m_000001_0 is done. And is in the
process of commiting
2011-04-15 10:24:04,842 INFO org.apache.hadoop.mapred.TaskRunner: Task
attempt_201104041514_0068_m_000001_0 is allowed to commit now
2011-04-15 10:24:05,405 WARN org.apache.hadoop.mapred.TaskRunner:
Failure committing: java.io.IOException: Failed to save output of
task: attempt_201104041514_0068_m_000001_0
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
	at org.apache.hadoop.mapred.Task.done(Task.java:691)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)

2011-04-15 10:24:11,846 WARN org.apache.hadoop.mapred.TaskTracker:
Error running child
java.io.IOException: Failed to save output of task:
attempt_201104041514_0068_m_000001_0
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
	at org.apache.hadoop.mapred.Task.done(Task.java:691)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)
2011-04-15 10:24:11,863 INFO org.apache.hadoop.mapred.TaskRunner:
Runnning cleanup for the task

I guess it has something to do with partitioning? Maybe the mappers are not
simultaneously able to write to the same file or something of that sort?

Thanks,
Hari

On Thu, Apr 14, 2011 at 6:37 PM, Hari Sreekumar <hsreekumar@clickable.com>wrote:

> That is exactly what I do when I have a reduce phase, and it works. But in
> case of map-only jobs, it doesn't work. I'll try overriding the
> getOutputfileFromInputFile() method.
>
>
> On Thu, Apr 14, 2011 at 5:19 PM, Harsh J <harsh@cloudera.com> wrote:
>
>> Hello again Hari,
>>
>> On Thu, Apr 14, 2011 at 5:10 PM, Hari Sreekumar
>> <hsreekumar@clickable.com> wrote:
>> > Here is a part of the code I am using:
>> >     jobConf.setOutputFormat(MultipleTextOutputFormat.class);
>>
>> You need to subclass the OF and use it properly, else the abstract
>> class takes over with the default name always used (Thus, 'part'). You
>> can see a good, complete example at [1].
>>
>> I'd still recommend using MultipleOutputs for better portability
>> reasons. Its javadocs explain how to go about using it well enough
>> [2].
>>
>> [1] -
>> https://sites.google.com/site/hadoopandhive/home/how-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat
>> [2] -
>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>>
>> --
>> Harsh J
>>
>
>

Mime
View raw message