hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "HadoopStreaming" by JenniferRM
Date Mon, 28 Jan 2008 22:02:23 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by JenniferRM:
http://wiki.apache.org/hadoop/HadoopStreaming

The comment on the change is:
Added some practical advice about getting streaming code to work.

------------------------------------------------------------------------------
  
  }}}
  
+ 
+ == Practical Help ==
+ Using the streaming system you can develop working hadoop jobs with ''extremely'' limited
knowldge of Java.  At it's simplest your development task is to write two shell scripts that
work well together, let's call them '''shellMapper.sh''' and '''shellReducer.sh'''.  On a
machine that doesn't even have hadoop installed you can get first drafts of these working
by writing them to work in this way:
+ 
+ {{{
+ cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile
+ }}}
+ 
+ With streaming, Hadoop basically becomes a system for making pipes from shell-scripting
work (with some fudging) on a cluster.  There's a strong logical correspondence between the
unix shell scripting environment and hadoop streaming jobs.  The above example with Hadoop
has somewhat less elegant syntax, but this is what it looks like:
+ 
+ {{{
+ stream -input /dfsInputDir/someInputData -file shellMapper.sh -mapper "shellMapper.sh" -file
shellReducer.sh  -reducer "shellReducer.sh" -output /dfsOutputDir/myResults  
+ }}}
+ 
+ The real place the logical correspondence breaks down is that in a one machine scripting
environment shellMapper.sh and shellReducer.sh will each run as a single process and data
will flow directly from one process to the other.  With Hadoop the shellMapper.sh file will
be sent to every machine on the cluster that has data chunks and each such machine will run
it's own chunk through the shellMapper.sh process on each machine.  The output from those
scripts ''doesn't'' run a reduce on each of those machines.  Instead the output is sorted
so that different lines from various mapping jobs are streamed across the network to different
machines (Hadoop defaults to four machines) where the reduce(s) can be performed.  
+ 
+ Here are practical tips for getting things working well:
+ 
+ * '''Use shell scripts rather than commands''' - The "-file shellMapper.sh" part isn't entirely
necessary.  You can simply use a clause like "-mapper 'sed | grep | awk'" or some such but
complicated quoting is can introduce bugs.  Wrapping the job in a shell script eliminates
some of these issues.
+ 
+ * '''Don't expect shebangs to work''' - If you're going to run other scripts from inside
your shell script, don't execpt a line like #!/bin/python to work.  To be certain that things
will work, run the script directly like "grep somethingInteresting | '''''perl''' perlScript''
| sort | uniq -c"
+ 
+ For more, see HowToDebugMapReducePrograms.
+ 

Mime
View raw message