hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simone Leo <simone....@crs4.it>
Subject pydoop -- Python MapReduce and HDFS API for Hadoop
Date Fri, 06 Nov 2009 17:20:36 GMT
Hello everybody,

we recently released pydoop, a Python MapReduce and HDFS API for Hadoop:


It is implemented as a Boost.Python wrapper around the C++ code (pipes
and libhdfs). It allows you to write complete MapReduce application in
CPython, with the same capabilities as the C++ API. Here is a minimal
wordcount example:

from pydoop.pipes import Mapper, Reducer, Factory, runTask

class WordCountMapper(Mapper):

  def __init__(self, context):
    super(WordCountMapper, self).__init__(context)

  def map(self, context):
    words = context.getInputValue().split()
    for w in words:
      context.emit(w, "1")

class WordCountReducer(Reducer):

  def __init__(self, context):
    super(WordCountReducer, self).__init__(context)

  def reduce(self, context):
    s = 0
    while context.nextValue():
      s += int(context.getInputValue())
    context.emit(context.getInputKey(), str(s))

runTask(Factory(WordCountMapper, WordCountReducer))

Any feedback would be greatly appreciated.

Simone Leo
Distributed Computing group
Advanced Computing and Communications program
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simleo@crs4.it

View raw message