incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: hadoop tasks reading from cassandra
Date Thu, 30 Jul 2009 21:09:33 GMT
On Wed, Jul 29, 2009 at 1:37 AM, Jeff Hodges<jeff@somethingsimilar.com> wrote:
> Comments inline.
>
> On Fri, Jul 24, 2009 at 10:00 AM, Jonathan Ellis<jbellis@gmail.com> wrote:
>> On Fri, Jul 24, 2009 at 11:08 AM, Jun Rao<junrao@almaden.ibm.com> wrote:
>>> 1. In addition to OrderPreservingPartitioner, it would be useful to support
>>> MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype
>>> that sort-of works at this moment. The difficulty with random partitioner
>>> is that it's a bit hard to generate the splits. In our prototype, we simply
>>> map each row to a split. This is ok for fat rows (e.g., a row includes all
>>> info for a user), but may be too fine-grained for other cases. Another
>>> possibility is to generate a split that corresponds to a set of rows in a
>>> hash-range (instead of key range). This requires some new apis in
>>> cassandra.
>>
>> -1 on adding new apis to pound a square peg into a round hole.
>>
>> like range queries, hadoop splits only really make sense on OPP.
>>
>
> Why would it only make sense on OPP? If it wasn't an externally
> exposed part of the api, what other concerns do you have about a hash
> range query? I can't think of any beyond the usual increased code
> complexity argument (i.e. development, testing and maintenance costs
> for it).

Because you have to violate encapsulation pretty badly and provide ops
acting on a hash instead of a key, so you'd be providing a parallel,
public api that only applies to the hash partitioner.

It's a bad enough hack that I'd say "feel free to maintain that in
your own tree, but not in the public repo." :)

> There is something in Hadoop that attempts to solve some of the data
> locality problem called NetworkTopology. It's used to provide data
> locality for CompileFileInputFormat (among, I'm sure, other things).
>
> Combining this with the knowledge we would have of which Node each key
> range would be from, there is a chance Hadoop could do some of the
> locality work for us. Looking at the code for CombineFileInputFormat,
> it doesn't seem to be particularly straightforward bit of work to
> translate to Cassandra, but I'm sure with a little time and maybe a
> little guidance from some Hadoop folks, we could make it happen.
>
> In any case, this seems to be evidence that locality can be added on
> later. It will not be a simple drop in deal, but it wouldn't seem to
> require us to completely overhaul how we think about the input
> splitting.

Jun mentioned #197 -- I'm still -1 on adding such a beast to the
thrift API, but I think it would be ok to expose it in
get_string_property, suitably (json?) encoded.

> (Oh, and has anyone got a mnemonic or anything to remember which of
> org.apache.hadoop.mapred and org.apache.hadoop.mapreduce is the new
> one? I'll be jiggered if I can keep it straight.)

mapreduce is the new one.  they got lucky and left the full name open
for their second try. :)

-Jonathan

Mime
View raw message