incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hanna <jeremy.hanna1...@gmail.com>
Subject Re: useful little way to run locally with (pig|hive) && cassandra
Date Wed, 15 Jun 2011 19:00:52 GMT
Cool - thanks Dmitriy!

On Jun 15, 2011, at 12:54 PM, Dmitriy Ryaboy wrote:

> Another tip:
> If you parametrize your load statements, it becomes easy to switch
> between loading from something like Cassandra, and reading from HDFS
> or local fs directly.
> 
> Also:
> Try using Pig's "illustrate" command when working through your flows
> -- it does some clever things that go far beyond simple random
> sampling of source data, in order to ensure that you can see the
> effects of doing filters, that joins get (possibly artificial)
> matching keys even if you sampled in a way that didn't actually
> produce any, etc.
> 
> D
> 
> On Wed, Jun 15, 2011 at 10:35 AM, Jeremy Hanna
> <jeremy.hanna1234@gmail.com> wrote:
>> We started doing this recently and thought it might be useful to others.
>> 
>> Pig (and Hive) have a sample function that allows you to sample data from your data
store.
>> 
>> In pig it looks something like this:
>> mysample = SAMPLE myrelation 0.01;
>> 
>> One possible use for this, with pig and cassandra is to solve a conundrum of testing
locally.  We've wondered how to do this so we decided to do sampling of a column family (or
set of CFs), store into HDFS (or CFS), download locally, then import into your local Cassandra
node.  That gives you real data to test against with pig/hive or for other purposes.
>> 
>> That way, when you're flying out to the Hadoop Summit or the Cassandra SF event,
you can play with real data :).
>> 
>> Maybe others have been doing this for years, but if not, we're finding it handy.
>> 
>> Jeremy


Mime
View raw message