From William Oberman <>
Subject cassandra + pig
Date Thu, 11 Oct 2012 14:43:30 GMT
I'm wondering how many people are using cassandra + pig out there?  I
recently went through the effort of validating things at a much higher
level than I previously did(*), and found a few issues:

In general, it seems like the widerow implementation still has rough edges.
 I'm concerned I'm not understanding why other people aren't using the
feature, and thus finding these problems.  Is everyone else just setting a
high static limit?  E.g.  LOAD 'cassandra://KEYSPACE/CF?limit=X" where X >=
the max size of any key?  Is everyone else using data models that result in
keys with # columns always less than 1024?  Do newer version of hadoop
consume the cassandra API in a way that work around these issues?  I'm
using CDH3 == hadoop 0.20.2, pig 0.8.1.

(*) I took a random subsample of 50,000 keys of my production data (approx
1M total key/value pairs, some keys having only a single value and some
having 1000's).  I then wrote both a pig script and simple procedural
version of the pig script.  Then I compared the results.  Obviously I
started with differences, though after locally patching my code to fix the
above 3 bugs (though, really only two issues), I now (finally) get the same

