hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Best practices for jobs with large Map output
Date Fri, 15 Apr 2011 12:15:30 GMT
Thanks for the prompt response Harsh !

The job is an indexing job. Each Mapper emits a small index and the Reducer
merges all of those indexes together. The Mappers output the index as a
Writable which serializes it. I guess I could write the Reducer's function
as a separate class as you suggest, but then I'll need to write a custom
OutputFormat that will put those indexes on HDFS or somewhere?

That complicates matters for me -- currently, when this job is run as part
of a sequence of jobs, I can guarantee that if the job succeeds, then the
indexes are successfully merged, and if it fails, the job should be
restarted. While that can be achieved with a separate FS-using program as
you suggest, it complicates matters.

Is my scenario that extreme? Would you say the common scenario for Hadoop
are jobs that output tiny objects between Mappers and Reducers?

Would this work much better if I work w/ several Reducers? I'm not sure it
will because the problem lies, IMO, in Hadoop allocating large consecutive
chunks of RAM in my case, instead of trying to either stream it or break it
down to smaller chunks.

Is there absolutely no way to bypass the shuffle + sort phases? I don't mind
writing some classes if that's what it takes ...


On Thu, Apr 14, 2011 at 9:50 PM, Harsh J <harsh@cloudera.com> wrote:

> Hello Shai,
> On Fri, Apr 15, 2011 at 12:01 AM, Shai Erera <serera@gmail.com> wrote:
> > Hi
> > I'm running on Hadoop 0.20.2 and I have a job with the following nature:
> > * Mapper outputs very large records (50 to 200 MB)
> > * Reducer (single) merges all those records together
> > * Map output key is a constant (could be a NullWritable, but currently
> it's
> > a LongWritable(1))
> > * Reducer doesn't care about the keys at all
> If I understand right, your single reducer's only work is to merge
> your multiple map's large record emits, and nothing else (It does not
> have 'keys' to worry about), correct?
> Why not do this with a normal FS-using program that opens a single
> file to write out map-materialized output files from a Map-only job to
> merge them?
> --
> Harsh J

View raw message