cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Oberman <ober...@civicscience.com>
Subject Re: cassandra + pig
Date Thu, 11 Oct 2012 15:28:31 GMT
If you don't mind me asking, how are you handling the fact that pre-widerow
you are only getting a static number of columns per key (default 1024)?  Or
am I not understanding the "limit" concept?

On Thu, Oct 11, 2012 at 11:25 AM, Jeremy Hanna
<jeremy.hanna1234@gmail.com>wrote:

> The Dachis Group (where I just came from, now at DataStax) uses pig with
> cassandra for a lot of things.  However, we weren't using the widerow
> implementation yet since wide row support is new to 1.1.x and we were on
> 0.7, then 0.8, then 1.0.x.
>
> I think since it's new to 1.1's hadoop support, it sounds like there are
> some rough edges like you say.  But issues that are reproducible on tickets
> for any problems are much appreciated and they will get addressed.
>
> On Oct 11, 2012, at 10:43 AM, William Oberman <oberman@civicscience.com>
> wrote:
>
> > I'm wondering how many people are using cassandra + pig out there?  I
> recently went through the effort of validating things at a much higher
> level than I previously did(*), and found a few issues:
> > https://issues.apache.org/jira/browse/CASSANDRA-4748
> > https://issues.apache.org/jira/browse/CASSANDRA-4749
> > https://issues.apache.org/jira/browse/CASSANDRA-4789
> >
> > In general, it seems like the widerow implementation still has rough
> edges.  I'm concerned I'm not understanding why other people aren't using
> the feature, and thus finding these problems.  Is everyone else just
> setting a high static limit?  E.g.  LOAD 'cassandra://KEYSPACE/CF?limit=X"
> where X >= the max size of any key?  Is everyone else using data models
> that result in keys with # columns always less than 1024?  Do newer version
> of hadoop consume the cassandra API in a way that work around these issues?
>  I'm using CDH3 == hadoop 0.20.2, pig 0.8.1.
> >
> > (*) I took a random subsample of 50,000 keys of my production data
> (approx 1M total key/value pairs, some keys having only a single value and
> some having 1000's).  I then wrote both a pig script and simple procedural
> version of the pig script.  Then I compared the results.  Obviously I
> started with differences, though after locally patching my code to fix the
> above 3 bugs (though, really only two issues), I now (finally) get the same
> results.
>
>

Mime
View raw message