incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hanna <>
Subject Re: pig counting question
Date Thu, 24 Mar 2011 23:00:31 GMT
And if you download the 0.7 branch and build the cassandra_storage.jar in the contrib/pig section
with that update, you should be able to use it with your 0.7.3 cluster.  Those changes are
typically independent of the Cassandra version.

On Mar 24, 2011, at 5:49 PM, Jeremy Hanna wrote:

> Hmmm, for wide rows, you can page it with I believe some changes on 0.7 branch that made
it in as part of recently.  Specifically,
using the 0.7 branch version of CassandraStorage, you can specify it using this basic template:
> cassandra://<keyspace>/<columnfamily>[?slice_start=<start>&slice_end=<end>[&reversed=true][&limit=1]]
> That goes in your pig LOAD block.
> So it's a pain to do what you're doing I would imagine but it's possible to page in the
latest on 0.7 branch.
> On Mar 24, 2011, at 3:57 PM, Jeffrey Wang wrote:
>> It looks like this functionality is not in the 0.7.3 version of CassandraStorage.
I tried to add the constructor which takes the limit to the class, but I ran into some Pig
parsing errors, so I had to make the parameter a string. How did you get around this for the
version of CassandraStorage in trunk? I'm running Pig 0.8.0.
>> Also, when I bump the limit up very high (e.g. 1M columns), my Cassandra starts eating
up huge amounts of memory, maxing out my 16GB heap size. I suspect this is because of the
get_range_slices() call from ColumnFamilyRecordReader. Are there plans to make this streaming/paged?
>> -Jeffrey
>> -----Original Message-----
>> From: Jeremy Hanna [] 
>> Sent: Thursday, March 24, 2011 11:34 AM
>> To:
>> Subject: Re: pig counting question
>> The limit defaults to 1024 but you can set it when you use CassandraStorage in pig,
like so:
>> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(4096);
>> or whatever value you wish.
>> Give that a try and see if it gives you more of what you're looking for.
>> On Mar 24, 2011, at 1:16 PM, Jeffrey Wang wrote:
>>> Hey all,
>>> I'm trying to run a very simple Pig script against my Cassandra cluster (5 nodes,
0.7.3). I've gotten it all set up and working, but the script is giving me some strange results.
Here is my script:
>>> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage();
>>> rowct = FOREACH rows GENERATE $0, COUNT($1);
>>> dump rowct;
>>> If I understand Pig correctly, this should output (row name, column count) tuples,
but I'm always seeing 1024 for the column count even though the rows have highly variable
number of columns. Am I missing something? Thanks.
>>> -Jeffrey

View raw message