cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hanna <>
Subject Re: prep for cassandra storage from pig
Date Wed, 15 Jun 2011 19:04:02 GMT
Hi Will,

That's partly why I like to use FromCassandraBag and ToCassandraBag from pygmalion - it does
the work for you to get it back into a form that cassandra understands.

Others may know better how to massage the data into that form using just pig, but if all else
fails, you could write a udf to do that.


On Jun 15, 2011, at 1:17 PM, William Oberman wrote:

> I think I'm stuck on typing issues trying to store data in cassandra.  To verify, cassandra
wants (key, {tuples})
> My pig script is fairly brief:
> raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS (key:chararray,
columns:bag {column:tuple (name, value)});
> --colums == timeUUID -> JSON
> rows = FOREACH raw GENERATE key, FLATTEN(columns);
> alias_target_day = FOREACH rows {
>     --I wrote a specialized parser that does exactly what I need
>     observation_map = com.civicscience.pig.ParseObservation($2);
>     GENERATE $0 as alias, observation_map#'_fqt' as target, observation_map#'_day' as
> };
> grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
> X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1, COUNT($1)) as day_count;
> This gets me:
> (targetA, (day1, count))
> (targetA, (day2, count))
> (targetB, (day1, count))
> ....
> But, cassandra wants the 2nd item to be a bag.  So, I tried:
> X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1, COUNT($1)))
as day_count;
> But this results in:
> (targetA, {((day1, count))})
> (targetA, {((day2, count))})
> (targetB, {((day1, count))})
> It's hard to see, but the 2nd item now has a nested tuple as the first value, which is
still bad.
> How to I get (key, {tuple})???  I wasn't sure where to post this (pig or cassandra),
so I'm posting to the pig list too.
> will

View raw message