hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "HadoopStreaming" by WimDepoorter
Date Wed, 29 Sep 2010 15:21:10 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "HadoopStreaming" page has been changed by WimDepoorter.
The comment on this change is: changed the location of streaming jar from "build/hadoop-streaming.jar"
to "$HADOOP_HOME/mapred/contrib/streaming/hadoop-0.xx.y-streaming.jar".
http://wiki.apache.org/hadoop/HadoopStreaming?action=diff&rev1=11&rev2=12

--------------------------------------------------

  Hadoop Streaming is a utility which allows users to create and run jobs with any executables
(e.g. shell utilities) as the mapper and/or the reducer.
  
  {{{
- 
- Usage: $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar [options]
+ Usage: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar
[options]
  Options:
    -input    <path>                   DFS input file(s) for the Map step
    -output   <path>                   DFS output directory for the Reduce step
@@ -55, +54 @@

     -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
  
  Shortcut to run from any directory:
-    setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/build/hadoop-streaming.jar"
+    setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar"
  
  Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
             -file /local/filter.pl -input "/logs/0604*/*" [...]
    Ships a script, invokes the non-shipped perl interpreter
    Shipped files go to the working directory so filter.pl is found by perl
    Input files are all the daily logs for days in month 2006-04
- 
  }}}
- 
- 
  == Practical Help ==
  Using the streaming system you can develop working hadoop jobs with ''extremely'' limited
knowldge of Java.  At it's simplest your development task is to write two shell scripts that
work well together, let's call them '''shellMapper.sh''' and '''shellReducer.sh'''.  On a
machine that doesn't even have hadoop installed you can get first drafts of these working
by writing them to work in this way:
  
  {{{
  cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile
  }}}
- 
  With streaming, Hadoop basically becomes a system for making pipes from shell-scripting
work (with some fudging) on a cluster.  There's a strong logical correspondence between the
unix shell scripting environment and hadoop streaming jobs.  The above example with Hadoop
has somewhat less elegant syntax, but this is what it looks like:
  
  {{{
- stream -input /dfsInputDir/someInputData -file shellMapper.sh -mapper "shellMapper.sh" -file
shellReducer.sh  -reducer "shellReducer.sh" -output /dfsOutputDir/myResults  
+ stream -input /dfsInputDir/someInputData -file shellMapper.sh -mapper "shellMapper.sh" -file
shellReducer.sh  -reducer "shellReducer.sh" -output /dfsOutputDir/myResults
  }}}
- 
- The real place the logical correspondence breaks down is that in a one machine scripting
environment shellMapper.sh and shellReducer.sh will each run as a single process and data
will flow directly from one process to the other.  With Hadoop the shellMapper.sh file will
be sent to every machine on the cluster that has data chunks and each such machine will run
it's own chunk through the shellMapper.sh process on each machine.  The output from those
scripts ''doesn't'' run a reduce on each of those machines.  Instead the output is sorted
so that different lines from various mapping jobs are streamed across the network to different
machines (Hadoop defaults to four machines) where the reduce(s) can be performed.  
+ The real place the logical correspondence breaks down is that in a one machine scripting
environment shellMapper.sh and shellReducer.sh will each run as a single process and data
will flow directly from one process to the other.  With Hadoop the shellMapper.sh file will
be sent to every machine on the cluster that has data chunks and each such machine will run
it's own chunk through the shellMapper.sh process on each machine.  The output from those
scripts ''doesn't'' run a reduce on each of those machines.  Instead the output is sorted
so that different lines from various mapping jobs are streamed across the network to different
machines (Hadoop defaults to four machines) where the reduce(s) can be performed.
  
  Here are practical tips for getting things working well:
  

Mime
View raw message