hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Sanda <john.sa...@gmail.com>
Subject using output from one job as input to another
Date Thu, 03 Mar 2011 02:21:55 GMT
Hi I am new to Hadoop, so maybe I am missing something obvious. I have
written a small map reduce program that runs two jobs. I want the output of
the first job to serve as the input to the second job. Here is what my
driver code looks like:

public int run(String[] args) throws Exception {
    Configuration conf = getConf();

    Job job = new Job(conf, "Job One");
    job.setJarByClass(CountCitations.class);

    Path in = new Path(args[0]);
    Path out1 = new Path("jobOneOutput");

    FileInputFormat.setInputPaths(job, in);
    FileOutputFormat.setOutputPath(job, out1);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setMapOutputKeyClass(LongWritable.class);
    job.setMapOutputValueClass(Text.class);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.waitForCompletion(true);

    job = new Job(conf, "Job Two");
    job.setJarByClass(MyJob.class);

    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.setInputPaths(job, out1);
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(MapCounts.class);
    job.setReducerClass(ReduceCounts.class);

    job.setMapOutputKeyClass(LongWritable.class);
    job.setMapOutputValueClass(Text.class);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    System.exit(job.waitForCompletion(true) ? 0 : 1);

    return 0;
}

The output path created from the first job is a directory, and it the file
in that directory that has a name like part-r-0000 that I want to feed as
input into the second job. I am running in pseudo-distributed mode so I know
that that file name is going to be the same every run. But in a true
distributed mode that file name will be different for each node. More over,
when in distributed mode don't I want a uniform view of that output file
which will be spread across my cluster? Is there something wrong in my code?
Or can someone point me to some examples that do this?

Thanks

- John

Mime
View raw message