hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "PythonWordCount" by OwenOMalley
Date Mon, 07 Aug 2006 18:13:19 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by OwenOMalley:
http://wiki.apache.org/lucene-hadoop/PythonWordCount

------------------------------------------------------------------------------
  
  This is the WordCount example completely translated into [http://python.org/ Python] and
translated using [http://www.jython.org/Project/index.html Jython] into a Java jar file.
  
- The program reads text files and counts how often words occur.  The input is text files
and the output is text files, each line of which contains a word and the count of how often
it occured, separated by a tab.
+ The program reads text files and counts how often words occur.  The input is text files
and the output is text files, each line of which contains a word and the count of how often
it occured, separated by a tab. To create some input, take your a directory of text files
and put it into DFS.
+ {{{
+ bin/hadoop dfs -put my-dir in-dir
+ }}}
  
  Each mapper takes a line as input and breaks it into words. It then emits a key/value pair
of the word and 1. Each reducer sums the counts for each word and emits a single key/value
with the word and sum.
  
  As an optimization, the reducer is also used as a combiner on the map outputs. This reduces
the amount of data sent across the network by combining each word into a single record.
  
  To compile the example, build the Hadoop code:{{{
-   ant
+ ant
-   cd src/examples/python
+ cd src/examples/python
-   ./compile
+ ./compile
+ cd ../../..
  }}}
  
  To run the example, the command syntax is: {{{
+ bin/hadoop jar src/examples/python/wc.jar in-dir out-dir
-   ../../../bin/hadoop jar wc.jar [-m <#maps>] [-r <#reducers>] \
-     <in-dir> <out-dir>
  }}}
  
+ The results of the word count will be in out-dir/part-*.
+ 

Mime
View raw message