hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "HowToDebugMapReducePrograms" by Amareshwari
Date Mon, 01 Oct 2007 09:14:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by Amareshwari:
http://wiki.apache.org/lucene-hadoop/HowToDebugMapReducePrograms

------------------------------------------------------------------------------
  
  This can be extremely useful to display debug information about the current record being
handled, or setting certain debug flags about the status of the mapper. While running locally
on a small data set can find many bugs, large data sets may contain pathological cases that
are otherwise unexepcted. This method of debugging can help catch those cases.
  
+ = How to debug Hadoop Pipes programs =
+ 
+ In order to debug Pipes programs you need to keep the downloaded commands. 
+ 
+ First, to keep the !TaskTracker from deleting the files when the task is finished, you need
to set either keep.failed.task.files (set it to true if the interesting task always fails)
or keep.task.files.pattern (set to a regex that includes the interesting task name).
+ 
+ Second, your job should set hadoop.pipes.command-file.keep to true in the !JobConf. This
will cause all of the tasks in the job to write their command stream to a file in the working
directory named downlink.data. This file will contain the JobConf, the task information, and
the task input, so it may be large. But it provides enough information that your executable
will run without any interaction with the framework. 
+ 
+ Third, go to the host where the problem task ran, go into the work directory and
+ {{{
+ setenv hadoop.pipes.command.file downlink.data
+ }}}
+ and run your executable under the debugger or valgrind. It will run as if the framework
was feeding it commands and data and produce a output file downlink.data.out with the binary
commands that it would have sent up to the framework. Eventually, I'll probably make the downlink.data.out
file into a text-based format, but for now it is binary. Most problems however, will be pretty
clear in the debugger or valgrind, even without looking at the generated data.
+ 
+ = The following sections are applicable only for Hadoop 0.15.0 and above =
+ 
- == Run a debug script when Task fails ==
+ = Run a debug script when Task fails =
  
  A facility is provided, via user-provided scripts, for doing post-processing on task logs,
task's stdout, stderr, syslog and core files. There is a default script which processes core
dumps under gdb and prints stack trace. The last five lines from stdout and stderr of debug
script are printed on the diagnostics. These outputs are displayed on job UI on demand. 
  
@@ -85, +101 @@

  Executable property can also be set by APIs DistributedCache.addCacheExecutable(URI,conf)
and DistributedCache.setCacheExecutables(URI[],conf) where URI is of the form "hdfs://host:port/<path>#<executable-name>".
  For Streaming, the executable can be added through -cacheExecutable URI.
  
- For gdb, the gdb command file need not be executable. But, the command file needs to be
in dfs. It can be added to cache by setting the property "mapred.cache.files" with the value
<path>#<cmd-file> or through the API DistribuedCache.addCacheFile(URI,conf).
+ For gdb, the gdb command file need not be executable. But, the command file needs to be
in dfs. It can be added to cache by setting the property "mapred.cache.files" with the value
<path>#<cmd-file> or through the API DistributedCache.addCacheFile(URI,conf).
  Please make sure the property "mapred.create.symlink" is set to "yes"
  
- = How to debug Hadoop Pipes programs =
- 
- In order to debug Pipes programs you need to keep the downloaded commands. 
- 
- First, to keep the !TaskTracker from deleting the files when the task is finished, you need
to set either keep.failed.task.files (set it to true if the interesting task always fails)
or keep.task.files.pattern (set to a regex that includes the interesting task name).
- 
- Second, your job should set hadoop.pipes.command-file.keep to true in the !JobConf. This
will cause all of the tasks in the job to write their command stream to a file in the working
directory named downlink.data. This file will contain the JobConf, the task information, and
the task input, so it may be large. But it provides enough information that your executable
will run without any interaction with the framework. 
- 
- Third, go to the host where the problem task ran, go into the work directory and
- {{{
- setenv hadoop.pipes.command.file downlink.data
- }}}
- and run your executable under the debugger or valgrind. It will run as if the framework
was feeding it commands and data and produce a output file downlink.data.out with the binary
commands that it would have sent up to the framework. Eventually, I'll probably make the downlink.data.out
file into a text-based format, but for now it is binary. Most problems however, will be pretty
clear in the debugger or valgrind, even without looking at the generated data.
- 

Mime
View raw message