hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simone Leo <simone....@crs4.it>
Subject Announcing Pydoop 0.3.6_rc2
Date Mon, 26 Jul 2010 14:50:42 GMT

we've just released version 0.3.6_rc2 of Pydoop
(http://pydoop.sourceforge.net). Pydoop is a Python MapReduce and HDFS
API for Hadoop, built upon the C++ Pipes and the C libhdfs APIs, that
allows to write full-fledged MapReduce applications with HDFS access.
Its key features are:

 * access to most MapReduce application components: Mapper, Reducer,
RecordReader, RecordWriter, Partitioner;
 * direct access to JobConf parameters; support for counters and status
 * CPython implementation: any Python module can be used, either pure
Python or C/C++ extension (note that this is not possible with Jython);
 * Direct HDFS access from Python.

With Pydoop you can write complete applications in Python, using a
programming style that's very similar to the one supported by the Java
and C++ APIs: developers define classes that are instantiated and used
by the framework. This allows for much cleaner and faster [1] code with
respect to the traditional Python + Streaming approach.

See http://pydoop.sourceforge.net/docs/examples for a collection of
Pydoop usage examples, including a complete application that leverages
the Hadoop Distributed Cache to distribute all required Python packages,
including Pydoop itself, to Hadoop cluster nodes.

Pydoop is actively used in production at our site, mostly for
data-intensive biocomputing applications.

The 0.3.6_rc2 release is being used internally in production. We'd
greatly appreciate any kind of feedback before we release it as 0.3.6
(stable), which we expect to do within two weeks or so.


 * download page: http://sourceforge.net/projects/pydoop/files
 * release notes:

[1] Simone Leo and Gianluigi Zanetti. "Pydoop: a Python MapReduce and
HDFS API for Hadoop". In Proceedings of the 19th ACM International
Symposium on High Performance Distributed Computing (HPDC 2010), pages
819–825. ACM, 2010.

Simone Leo
Distributed Computing group
Advanced Computing and Communications program
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simleo@crs4.it

View raw message