hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <o...@yahoo-inc.com>
Subject Re: Hadoop debugging
Date Mon, 17 Jul 2006 16:04:40 GMT

On Jul 17, 2006, at 1:36 AM, Thomas FRIOL wrote:

> Hi all,
> I am a new hadoop user and I am now writting my own map reduce 
> operations but it is hard for me to find out where comes from the 
> problem when the job fails.
> So my question is : What is the best way to debug a map reduce job ?

Ok, I should probably put this onto a wiki page, but my short answer is:

1. Start by getting everything running (likely on a small input) in the 
local runner. You do this by setting your
job tracker to "local" in your config. The local runner can run under 
the debugger and is not distributed.

2. Run the small input on a 1 node cluster. This will smoke out all of 
the issues that happen with distribution and the "real" task runner, 
but you only have a single place to look at logs. Most useful are the 
task and job tracker logs. Make sure you are logging at the INFO level 
or you will miss clues like the output of your tasks.

3. Run on a big cluster. Recently, I added the keep.failed.task.files 
config variable that tells the system to keep files for tasks that 
fail. This leaves "dead" files around that you can debug with. On the 
node with the failed task, go to the task tracker's local directory and 
cd to <local>/taskTracker/<taskid> and run
% hadoop org.apache.hadoop.IsolationRunner job.xml
This will run the failed task in a single jvm, which can be in the 
debugger, over precisely the same input.

I also have a patch that will let you specify a task to keep, even if 
it doesn't fail. Other than that, logging is your friend.

I don't have issues with my log messages getting through, so you might 
check your filters. Exceptions are mostly handled right, but we've 
found and fixed spots where they weren't, so that is possible. Usually 
it involves someone throwing an unchecked exception like RuntimeError 
and the catch only catching checked exceptions.

-- Owen

View raw message