hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elton Pinto <epti...@gmail.com>
Date Fri, 10 Sep 2010 06:36:33 GMT
Just a follow-up: you were right. It was the speculative execution tasks. It
turns out that because we weren't using the OutputCollector we had a race
condition under speculative execution. We are going to re-factor to use the
OutputCollector, but in the meantime we just turned off speculative



On Thu, Sep 9, 2010 at 11:56 AM, Allen Wittenauer

> On Sep 9, 2010, at 11:42 AM, Elton Pinto wrote:
> > Does anyone know the difference between the Hadoop counter
> TOTAL_LAUNCHED_MAPS and the "mapred.map.tasks" parameter available in the
> JobConf?
> mapred.map.tasks is what Hadoop thinks you need at a minimum.
> TOTAL_LAUNCHED_MAPS will be all map task attempts, including speculative
> execution and task recovery.
> > We're seeing some situations where these two don't match up, and so we're
> dropping data between jobs.
> ... which given the above is fairly normal.
> > We know we're dropping data because the bytes written to HDFS in the
> first job doesn't match up with the number of bytes read into the second
> job, and the number of input files is equivalent to "mapred.map.tasks".
> I'm fairly certain that the byte counters include all written bytes,
> including data that is essentially thrown away due to the above.
> > And it is dropping legitimate data upon further analysis (not duplicate
> data from speculative execution or anything like that - speculative
> execution not likely to happen in these jobs to be honest though because
> they're so fast).
> It doesn't matter how fast.  Depending upon which version of Hadoop, it may
> launch speculatives if there are task cycles.  For example, I'm looking at a
> job on our grid right now that has 300 map tasks that average 40 seconds.
>  It got 96 spec exec tasks to go with those 300, for a total of 396 map
> tasks.
> > Unfortunately, we run so many jobs that the JobTracker doesn't show us
> logs older than maybe 20 minutes ago so it's really hard to catch this
> problem in progress.
> All of the log data should still be on the job tracker.  You just can't use
> the GUI to see it. :)

View raw message