hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer <awittena...@linkedin.com>
Date Thu, 09 Sep 2010 18:56:53 GMT

On Sep 9, 2010, at 11:42 AM, Elton Pinto wrote:

> Does anyone know the difference between the Hadoop counter TOTAL_LAUNCHED_MAPS and the
"mapred.map.tasks" parameter available in the JobConf? 

mapred.map.tasks is what Hadoop thinks you need at a minimum.

TOTAL_LAUNCHED_MAPS will be all map task attempts, including speculative execution and task

> We're seeing some situations where these two don't match up, and so we're dropping data
between jobs.

... which given the above is fairly normal.

> We know we're dropping data because the bytes written to HDFS in the first job doesn't
match up with the number of bytes read into the second job, and the number of input files
is equivalent to "mapred.map.tasks".

I'm fairly certain that the byte counters include all written bytes, including data that is
essentially thrown away due to the above.

> And it is dropping legitimate data upon further analysis (not duplicate data from speculative
execution or anything like that - speculative execution not likely to happen in these jobs
to be honest though because they're so fast). 

It doesn't matter how fast.  Depending upon which version of Hadoop, it may launch speculatives
if there are task cycles.  For example, I'm looking at a job on our grid right now that has
300 map tasks that average 40 seconds.  It got 96 spec exec tasks to go with those 300, for
a total of 396 map tasks.

> Unfortunately, we run so many jobs that the JobTracker doesn't show us logs older than
maybe 20 minutes ago so it's really hard to catch this problem in progress.

All of the log data should still be on the job tracker.  You just can't use the GUI to see
it. :)

View raw message