hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Falgout <jim.falg...@pervasive.com>
Subject RE: Streaming mappers frequently time out
Date Wed, 23 Mar 2011 18:46:15 GMT
I've run into that before. Try setting mapreduce.task.timeout. I seem to remember that setting
it to zero may turn off the timeout, but of course can be dangerous if you have a runaway
task. The default is 600 seconds ;-)

Check out http://hadoop.apache.org/mapreduce/docs/current/mapred-default.html. It lists a
bunch of map reduce properties. 

-----Original Message-----
From: Keith Wiley [mailto:kwiley@keithwiley.com] 
Sent: Wednesday, March 23, 2011 12:33 PM
To: common-user@hadoop.apache.org
Subject: Streaming mappers frequently time out

My streaming mappers frequently die with this error:

Task attempt_201103101623_12864_m_000032_1 failed to report status for 602 seconds. Killing!

A repeated attempt of the same task generally succeeds, but it's very time-wasteful that the
task has been held up by 10 minutes.  My mapper (and reducer) are C++ and use pthreads.  I
start a reporter thread as soon as the task starts and that reporter thread sends periodic
reporter and status messages to cout using the streaming reporter syntax, but I still get
these errors occasionally.

Also, the task logs for such failed mappers are always either empty or unretrievable.  They
don't show ten minutes of actual work on the worker thread while the reporter should have
been reporting.  Rather, they are empty (or like I said, totally unretrievable).  It seems
to me that Hadoop is failing to even start these tasks.  If the C++ binary had actually been
kicked off, the logs would show SOME kind of output (on cerr) even if the reporter thread
had not been started properly because I send output to cerr before even starting the reporter
thread, in fact, before any pthread-related wonkery at all (I send output right from the entry
to main(), yet the logs are empty), so I really think Hadoop isn't even starting the binary,
but then waits ten minutes to kill the task anyway.

Has anyone else seen anything like this?


Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented
is to be vile and ignorant, and that to aspire is better than to be blindly and impotently
                                           --  Edwin A. Abbott, Flatland ________________________________________________________________________________

View raw message