hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Klaas Bosteels (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4304) Add Dumbo to contrib
Date Fri, 14 Nov 2008 01:02:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647488#action_12647488
] 

Klaas Bosteels commented on HADOOP-4304:
----------------------------------------

It actually is necessary that the Dumbo python module gets installed in the system directory
of the machine from which you run your jobs, because when you run e.g.

python wordcount.py -hadoop ~hadoop/hadoop-0.17.2.1 -input brian.txt -output brian-wc -file
excludes.txt

the "dumbo.run()" call gets executed on that machine, which makes Dumbo generate and excute
the Streaming command

/home/hadoop/hadoop-0.17.2.1/bin/hadoop jar /home/hadoop/hadoop-0.17.2.1/contrib/streaming/hadoop-0.17.2.1-streaming.jar
-input 'brian.txt' -output 'brian-wc' -file 'excludes.txt' -mapper 'python wordcount.py map
0' -reducer 'python wordcount.py red 0' -file 'wordcount.py' -file '/usr/lib/python2.4/site-packages/dumbo.py'
-jobconf 'mapred.job.name=wordcount.py'

under the hood. The part 

-file '/usr/lib/python2.4/site-packages/dumbo.py'

of this automatically generated command makes sure that the Dumbo module from the system dir
is put in the working dir on each cluster node (i.e. it is not necessary to install the Dumbo
python module in the system dir on the cluster nodes), and the extra args supplied to "python
wordcount.py" in

-mapper 'python wordcount.py map 0'
-reducer 'python wordcount.py red 0' 

allow Dumbo to know that it has to run the actual map or reduce instead of generating and
executing a Streaming command on the nodes. Hence, we need the ability to install Dumbo in
the system dir in order to make it very easy to start jobs, which is one of the main features
of dumbo (some more info about this can be found at http://github.com/klbostee/dumbo/wikis/running-programs).


I also do not think that the "classes" and "test" dirs are superfluous (but I might very well
be wrong). As mentioned in the description of this ticket, Dumbo also consists of some helper
Java code like e.g. a special inputformat that allows to easily use sequencefiles as input
for Dumbo programs (there currently is no public documentation for these features available,
but we are planning to change that once this patch gets approved), and there are also some
unit tests for this helper code.


Concerning the NFS mounting I'm not really sure what you mean. Maybe that "excludes.txt" has
to be in the cwd? This is handled by adding the option "-file excludes.txt" to the command
(like in the command above), so there is no need for any NFS mounts...



> Add Dumbo to contrib
> --------------------
>
>                 Key: HADOOP-4304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4304
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>            Priority: Minor
>         Attachments: hadoop-4304-v2.patch, hadoop-4304-v3.patch, hadoop-4304.patch
>
>
> Originally, Dumbo was a simple Python module developed at Last.fm to make writing and
running Hadoop Streaming programs very easy, but now it also consists of some (up till now
unreleased) helper code in Java (although it can still be used without the Java code). We
propose to add Dumbo to "src/contrib" such that the Java classes get build/installed together
with the rest of Hadoop, and the Python module can be installed separately at will. A tar.gz
of the directory that would have to be added to "src/contrib" is available at
> http://static.last.fm/dumbo/dumbo-contrib.tar.gz
> and more info about Dumbo can be found here:
> * Basic documentation: http://github.com/klbostee/dumbo/wikis
> * Presentation at HUG (where it was first suggested to add Dumbo to contrib): http://skillsmatter.com/podcast/home/dumbo-hadoop-streaming-made-elegant-and-easy
> * Initial announcement: http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant
> For some of the more advanced features of Dumbo (in particular the ones for which the
Java classes are needed) there is no public documentation yet, but we could easily fill that
gap by moving some of the internal Last.fm documentation to the Hadoop wiki.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message