hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Sreekumar <hsreeku...@clickable.com>
Subject Re: Using MultipleTextOutputFormat for map-only jobs
Date Thu, 14 Apr 2011 11:40:27 GMT
Here is a part of the code I am using:

static class mapperClass extends MapReduceBase implements
          Mapper<LongWritable, Text, Text, Text> {
    @Override
    public void map(LongWritable key, Text value,
            OutputCollector<Text, Text> output, Reporter reporter)
            throws IOException {
      output.collect(
              NullWritable.get(),
              value);
    }
  }

...
...
@Override
  public int run(String[] args) throws Exception {

    Configuration conf = new Configuration();

    Path[] inputPaths = new Path[args.length - 1];
    for (int i = 0; i < args.length - 1; ++i) {
      inputPaths[i] = new Path(args[i]);
    }

    String outputPath = args[args.length - 1].trim();

    JobConf jobConf = new JobConf(conf, this.getClass());

    jobConf.setMapperClass(mapperClass.class);
    jobConf.setNumReduceTasks(0);
    jobConf.setMapOutputKeyClass(NullWritable.class);
    jobConf.setMapOutputValueClass(Text.class);
    jobConf.setOutputKeyClass(NullWritable.class);
    jobConf.setOutputValueClass(Text.class);
    jobConf.setOutputFormat(MultipleTextOutputFormat.class);
    jobConf.setBoolean(
            "mapred.output.compress",
            true);
    jobConf.setClass(
            "mapred.output.compression.codec",
            GzipCodec.class,
            CompressionCodec.class);
    FileInputFormat.setInputPaths(
            jobConf,
            inputPaths);
    FileOutputFormat.setOutputPath(
            jobConf,
            new Path(outputPath));

    JobClient.runJob(jobConf);
    return 0;
  }
  public static void main(String[] args) throws Exception {
    int returnValue = ToolRunner.run(
            new MapReduceClass(),
            args);
    System.exit(returnValue);
  }

Thanks,
Hari

On Thu, Apr 14, 2011 at 1:22 PM, Harsh J <harsh@cloudera.com> wrote:

> Hello Hari,
>
> On Thu, Apr 14, 2011 at 11:09 AM, Hari Sreekumar
> <hsreekumar@clickable.com> wrote:
> > Hi,
> > I have a map-only mapreduce job where I want to deduce the output
> filename
> > from the output key/value. I figured MultipleTextOutputFormat is the best
> > fit for my purpose. But I am unable to use it in map-only jobs. I was
> able
> > to run it if I add a reduce phase. But when I use map-only jobs, the file
> > gets written to the usual part-0000xx files. Also, is there no support
> for
> > this output format in v0.20.2? I mean, is it necessary to use the
> deprecated
> > classes if I want to use this?
> > Thanks,
> > Hari
>
> The class MultipleOutputFormat is not available in the Hadoop for the
> new, unstable API, as it has been replaced in functionality by the
> MultipleOutputs class that does the same very similarly. However, the
> new API MultipleOutputs is not part of the Apache's Hadoop 0.20.2
> release either [1].
>
> Using the stable API is still recommended (it is no longer marked
> deprecated in 0.20.3 and 0.21 also supports the old API)
>
> That said, it should still work for Map-only jobs as described in two
> of its usecases [2]. Could you give us some details of your code setup
> for using this?
>
> [1] - It is available as part of 0.21.0, though, or in Cloudera's
> Distribution including Apache Hadoop 0.20.2.
> [2] -
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
>
> --
> Harsh J
>

Mime
View raw message