From Jeremy Hanna <>
Subject Re: hadoop map join with ColumnFamilyInputFormat
Date Thu, 01 Mar 2012 15:08:40 GMT
I haven't used that in particular, but it's pretty trivial to do that with Pig and I would
imagine it would just do the right thing under the covers.  It's a simple join with Pig. 
We use pygmalion to get data from the Cassandra bag.  A simple example would be:
DEFINE FromCassandraBag org.pygmalion.udf.FromCassandraBag();

raw_billing_acount =  LOAD 'cassandra://voltron/billing_account' USING org.apache.cassandra.hadoop.pig.CassandraStorage()
AS (id:chararray, columns:bag {column:tuple (name, value)});
billing_account = FOREACH raw_billing_account GENERATE
        FLATTEN(FromCassandraBag('name, age, address, city, state, zip',columns)) AS (
		name:		chararray,
		age: 		chararray,
		address: 	chararray,
		city: 		chararray,
		state:		chararray,
		zip:			chararay

raw_game_account =  LOAD 'cassandra://voltron/game_account' USING org.apache.cassandra.hadoop.pig.CassandraStorage()
AS (id:chararray, columns:bag {column:tuple (name, value)});
game_account = FOREACH raw_game_account GENERATE
        FLATTEN(FromCassandraBag('username, level, experience_points, super_powers, vehicles',columns))
AS (
		username:			chararray,
		level: 				chararray,
		experience_points: 	chararray,
		super_powers: 		chararray,
		vehicles:			chararray

composite_relation = FOREACH
	(join billing_account by id, game_account by id)
		billing_account::id as id,

Anyway - not sure if that's what you're looking for but that's what we do a lot of with Pig
- joins on any attribute or group bys or things like that.

On Mar 1, 2012, at 4:45 AM, Benoit Mathieu wrote:

> Hi all,
> I want to write a MapReduce job with a Map task taking its data from 2
> CFs. Those 2 CFs have the same row keys and are in same keyspace, so
> they are partionned the same way across my cluster and it would be
> nice that the Map task reads the both column families locally.
> In hadoop package org.apache.hadoop.mapred.join, there is a
> CompositeInputFormat class, which seems to do what I want, but it
> seems related to HDFS files as the "compose" method takes "Path" args.
> Does anyone have ever wrote a CompositeColumnFamilyInputFormat ? or
> have any insight about it ?
> Cheers,
> Benoit

