hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris K Wensel <ch...@wensel.net>
Subject cascading + riffle + ?
Date Tue, 03 Aug 2010 18:19:55 GMT

Sorry, cross posting to save time.

I now have a WIP of Cascading 1.2 that includes support for Riffle annotations.

Riffle is an Apache licensed library that includes Java annotations for marking lifecycle
and dependency methods on a 'process' object.

That is, you can create custom objects with 'start' and 'stop' methods, as well as with getters
for incoming/outgoing resources (input files, and output files).

With a collection of such objects, each one for a particular task like running a copy job,
or Mahout process, you can have either Riffle or Cascading chain and execute all the processes
in dependency order.

You can see more about Riffle here (which includes a tool to run a collection of processes):

You can download WIP builds for Cascading 1.2 (1.1 is the current stable version) here:

Note that Riffle is very early stage (and likely naive), and the Cascading support is likely
to evolve before the 1.2 final release (sometime this fall).

The long term goal here is to allow Mahout and other projects to apply the annotations, and
then third party tools can be used to run the processes.

For you Cascading users, writing a simple DistCp wrapper (or putting the annotations directly
on hadoop DistCp object, would allow a efficient copy to run inside of a Cascade process along
side your Flow instances.

Or more importantly, you can write iterative processes (e.g. page rank, etc) that act like
a single process even though internally there is a unknown number of Flows being created on
the fly. (I'm running a connected component algorithm that requires multiple Flows/passes
in production now as a Riffle object)

Please feel free to fork and tweak.


Chris K Wensel

View raw message