hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4304) Add Dumbo to contrib
Date Mon, 29 Sep 2008 20:40:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635515#action_12635515
] 

Owen O'Malley commented on HADOOP-4304:
---------------------------------------

I've only looked very casually at this, but:
 * I'd suggest that run should take either methods or classes for map and reduce. That would
remove the need for mapconf and reduceconf parameters that are doing initialization. Basically,
I'd like to see:

{code}
class Tokenizer(Mapper):
  def __init__(self):
    file = open("excludes.txt","r")
    self.excludes = set(line.strip() for line in file)
    file.close()


  def map(self, key, value, context):
    for word in value.split():
        if not (word in self.excludes): yield word,1

class Summer(Reducer):
  def reduce(self, key, values, context):
    yield key,sum(values)
{code}

Of course, I'd suggest leaving your current map and reduce methods also. So that you could
either do:

{code}
  dumbo.run(Tokenizer, Summer, combiner=Summer);
- or -
  dumbo.run(my_map, my_reduce, combiner=my_reduce);
{code}

Personally, I'd rather have you use Swig and the Pipes C++ interface rather than streaming,
but I'm biased. *Smile* (Although it would give you better performance, and allow binary data
to be processed.)



> Add Dumbo to contrib
> --------------------
>
>                 Key: HADOOP-4304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4304
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Klaas Bosteels
>            Priority: Minor
>
> Originally, Dumbo was a simple Python module developed at Last.fm to make writing and
running Hadoop Streaming programs very easy, but now it also consists of some (up till now
unreleased) helper code in Java (although it can still be used without the Java code). We
propose to add Dumbo to "src/contrib" such that the Java classes get build/installed together
with the rest of Hadoop, and the Python module can be installed separately at will. A tar.gz
of the directory that would have to be added to "src/contrib" is available at
> http://static.last.fm/dumbo/dumbo-contrib.tar.gz
> and more info about Dumbo can be found here:
> * Basic documentation: http://github.com/klbostee/dumbo/wikis
> * Presentation at HUG (where it was first suggested to add Dumbo to contrib): http://skillsmatter.com/podcast/home/dumbo-hadoop-streaming-made-elegant-and-easy
> * Initial announcement: http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant
> For some of the more advanced features of Dumbo (in particular the ones for which the
Java classes are needed) there is no public documentation yet, but we could easily fill that
gap by moving some of the internal Last.fm documentation to the Hadoop wiki.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message