hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Generating many small PNGs to Amazon S3 with MapReduce
Date Thu, 16 Apr 2009 17:28:07 GMT
On Thu, Apr 16, 2009 at 1:27 AM, tim robertson <timrobertson100@gmail.com>wrote:

> What is not 100% clear to me is when to push to S3:
> In the Map I will output the TileId-ZoomLevel-SpeciesId as the key,
> along with the count, and in the Reduce I group the counts into larger
> tiles, and create the PNG.  I could write to Sequencefile here... but
> I suspect I could just push to the s3 bucket here also - as long as
> the task tracker does not send the same Keys to multiple reduce tasks
> - my Hadoop naivity showing here (I wrote an in memory threaded
> MapReduceLite which does not compete reducers, but not got into the
> Hadoop code quite so much yet).
Hi Tim,

If I understand what you mean by "compete reducers", then you're referring
to the feature called "speculative execution", in which Hadoop schedules
multiple TaskTrackers to perform the same task. When one of the
multiply-scheduled tasks finishes, the other one is killed. As you seem to
already understand, this might cause issues if your tasks have
non-idempotent side effects on the outside world.

The configuration variable you need to look at is
mapred.reduce.tasks.speculative.execution. If this is set to false, only one
reduce task will be run on each key. If it is true, it's possible that some
reduce tasks will be scheduled twice to try to reduce variance in job
completion times due to slow machines.

There's an equivalent configuration variable
mapred.map.tasks.speculative.execution that controls this behavior for your
map tasks.

Hope that helps,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message