I just pulled the code and read through the design.  Great stuff.

Any thought to potentially using this for real-time processing as well?  Right now, we have a set of Hadoop M/R jobs that operate against Cassandra for ETL.  We were looking at using Storm for the real-time processing side of things and thought that we could actually abandon Hadoop entirely if we could introduce Cassandra's concept of data locality to Storm.  We plan to run head-to-head comparisons between Storm and Hadoop to test out the viability of that approach.

Peregrine looks like another contender.


On Dec 27, 2011, at 6:14 AM, Kevin Burton wrote:

A key innovation here is a partitioning layout algorithm that can support fast
many to many recovery similar to HDFS but still support partitioned operation
with deterministic key placement.

Thanks for your contribution.

Is here more detail info on this point?

yes... our design document:

I actually will probably write a paper on this...

The more I started down the partitioned filesystem approach in terms of mapreduce the more I realized that there were some REALLY elegant imoplementation and design issues that I did not originally appreciate ... (so I partially got lucky).

I think this approach could be generalized to work on normal map reduce jobs without much overhead.


Location: San Francisco, CA
Skype: burtonator
Skype-in: (415) 871-0687

Brian ONeill
Lead Architect, Health Market Science (