Hi all,

I am solving a issue with pig integration with cassandra using CqlLoader. I don't know exactly if the problem is in CqlLoader, my low understanding of Pig (I hope this is actually the case) or some bug in the combination of Pig and CqlLoader. Sorry if this turns out to be rather a Pig question and not a Cassandra one.

I have a table using cql maps:

CREATE TABLE test (
  name text PRIMARY KEY,
  sources map<text, text>
)

I need to denormalise the map in order to perform some sanitary checks on the rest of the DB (outer join using values from the map with another tables in cassandra keyspace). I want to create triples containing table key, map key and map value for further joining. The size of the map is anything between null and tens of records. The table test itself is pretty small.

This is what I do:

grunt> data = LOAD 'cql://keyspace/test' USING CqlStorage();
grunt> describe data;
data: {name: chararray,sources: ()}
grunt> data1 = filter data by sources is not null;
grunt> dump data1;
(name1,((k1,s1),(k2,s2)))
grunt> data2 = foreach data1 generate name, flatten(sources);
grunt> dump data2;
(name1,(k1,s1),(k2,s2))
grunt> describe data2;
Schema for data2 unknown.
grunt> data3 = FOREACH data2 generate $0 as name, FLATTEN(TOBAG($1..$100)); // I know there will be max tens of records in the map
grunt> dump data3;
(name1,k1,s1)
(name1,k2,s2)
(name1,)
(name1,)
... 95 more lines here ...
grunt> data4 = FILTER data3 BY $1 IS NOT null;
grunt> dump data4;
(name1,k1,s1)
(name1,k2,s2)
grunt> describe data4;
data4: {name: bytearray,bytearray}
grunt> data5 = foreach data4 generate $0, $1;
grunt> dump data5;
(name1,k1)
(name1,k2)
grunt> p = foreach data4 generate $0, $2;
Details at logfile: /..../pig_xxx.log
From the log file:
Pig Stack Trace
---------------
ERROR 1000: 
<line 28, column 33> Out of bound access. Trying to access non-existent column: 2. Schema name:bytearray,:bytearray has 2 column(s).

org.apache.pig.impl.plan.PlanValidationException: ERROR 1000: 
<line 28, column 33> Out of bound access. Trying to access non-existent column: 2. Schema name:bytearray,:bytearray has 2 column(s).
at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:197)
at org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174)

Considering the schema - no surprise. What is strange is the fact I see the map values in dump (see dump data4), but I have no way to get them using pig latin.

I tried to simulate the situation using PigStorage loader. This is the best I got (not exactly the same, but roughly):

grunt> data = load 'test.csv' using PigStorage(',');
grunt> dump data;
(key1,mk1,mv1,mk2,mv2)
(key2)
(key3,mk1,mv3,mk2,mv4)
grunt> data1 = foreach data generate $0, TOTUPLE($1, $2), TOTUPLE($3, $4);
grunt> dump data1;
(key1,(mk1,mv1),(mk2,mv2))
(key2,(,),(,))
(key3,(mk1,mv3),(mk2,mv4))
grunt> data2 = FOREACH data1 generate $0 as name, FLATTEN(TOBAG($1..$2));
grunt> dump data2;
(key1,mk1,mv1)
(key1,mk2,mv2)
(key2,,)
(key2,,)
(key3,mk1,mv3)
(key3,mk2,mv4)
grunt> describe data2;
data2: {name: bytearray,bytearray,bytearray}

Which is exactly what I need. The only problem is this simulation doesn't allow me to specify the arbitrary high value in the FLATTEN(TOBAG()) call - I need to know in advance what is the size of the row.

Questions:

- is this the correct way to denormalize the data? This is a pig question, but maybe someone will know (I am a pig newbie).
- couln't there be a problem with internal data representation returned from CqlStorage? See the difference between data loaded from file and these loaded from cassandra.

Versions: cassandra 1.2.11, Pig 0.12.

Thanks in advance,

Ondrej Cernos