incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: hadoop tasks reading from cassandra
Date Fri, 24 Jul 2009 17:00:20 GMT
On Fri, Jul 24, 2009 at 11:08 AM, Jun Rao<junrao@almaden.ibm.com> wrote:
> 1. In addition to OrderPreservingPartitioner, it would be useful to support
> MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype
> that sort-of works at this moment. The difficulty with random partitioner
> is that it's a bit hard to generate the splits. In our prototype, we simply
> map each row to a split. This is ok for fat rows (e.g., a row includes all
> info for a user), but may be too fine-grained for other cases. Another
> possibility is to generate a split that corresponds to a set of rows in a
> hash-range (instead of key range). This requires some new apis in
> cassandra.

-1 on adding new apis to pound a square peg into a round hole.

like range queries, hadoop splits only really make sense on OPP.

> 2. For better performance, in the future, it would be useful to expose and
> exploit data locality in cassandra so that a map task is executed on a
> cassandra node that owns the data locally. A related issue is
> https://issues.apache.org/jira/browse/CASSANDRA-197. It breaks
> encapsulation, but it's worth thinking about. Google's DFS and Bigtable
> both expose certain locality info for better performance.

That's why I'd like to ship hadoop integration out of the box, instead
of adding apis that should really be internal-use only for an external
hadoop layer.

-Jonathan

Mime
View raw message