cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: pig and widerows
Date Mon, 01 Oct 2012 21:08:18 GMT
That looks like it may be a bug, can you create a ticket on


Aaron Morton
Freelance Developer

On 28/09/2012, at 7:50 AM, William Oberman <> wrote:

> I don't want to switch my cassandra to HEAD, but looking at the newest code for CassandraStorage,
I'm concerned the Uri parsing for widerows isn't going to work.  setLocation first calls setLocationFromUri
(which sets widerows to the Uri value), but then sets widerows to a static value (which is
defined as false), and then it sets widerows to the system setting if it exists.  That doesn't
seem right...  ?
> But setLocationFromUri also gets called from setStoreLocation, and I don't really know
the difference between setLocation and setStoreLocation in terms of what is going on in terms
of the integration between cassandra/pig/hadoop.
> will
> On Thu, Sep 27, 2012 at 3:26 PM, William Oberman <> wrote:
> The next painful lesson for me was figuring out how to get logging working for a distributed
hadoop process.   In my test environment, I have a single node that runs name/secondaryname/data/job
trackers (call it "central"), and I have two cassandra nodes running tasktrackers.  But, I
also have cassandra libraries on the central box, and invoke my pig script from there.   I
had been patching and recompiling cassandra (1.1.5 with my logging, and the system env fix)
on that central box, and SOME of the logging was appearing in the pig output.  But, eventually
I decided to move that recompiled code to the tasktracker boxes, and then I found even more
of the logging I had added in:
> /var/log/hadoop/userlogs/JOB_ID
> on each of the tasktrackers.
> Based on this new logging, I found out that the widerows setting wasn't propagating from
the central box to the tasktrackers.  I added:
> export PIG_WIDEROW_INPUT=true
> To on each of the tasktrackers and it finally worked!  
> So, long story short, to actually get all columns for a key I had to:
> 1.) patch 1.1.5 to honor the "PIG_WIDEROW_INPUT=true" system setting
> 2.) add the system setting to ALL nodes in the hadoop cluster
> I'm going to try to undo all of my other hacks to get logging/printing working to confirm
if those were actually the only two changes I had to make.
> will
> On Thu, Sep 27, 2012 at 1:43 PM, William Oberman <> wrote:
> Ok, this is painful.  The first problem I found is in stock 1.1.5 there is no way to
set widerows to true!  The new widerows URI parsing is NOT in 1.1.5.  And for extra fun, getting
the value from the system property is BROKEN (at least in my centos linux environment).
> Here are the key lines of code (in CassandraStorage), note the different ways of getting
the property!  getenv in the test, and getProperty in the set:
>         widerows = DEFAULT_WIDEROW_INPUT;
>         if (System.getenv(PIG_WIDEROW_INPUT) != null)
>             widerows = Boolean.valueOf(System.getProperty(PIG_WIDEROW_INPUT));
> I added this logging:
>         logger.warn("widerows = " + widerows + " getenv=" + System.getenv(PIG_WIDEROW_INPUT)
+ " getProp="+System.getProperty(PIG_WIDEROW_INPUT));
> And I saw:
> org.apache.cassandra.hadoop.pig.CassandraStorage - widerows = false getenv=true getProp=null
> So for me getProperty != getenv :-(
> For people trying to figure out how to debug cassandra + hadoop + pig, for me the key
to get debugging and logging working was to focus on /etc/hadoop/conf (not /etc/pig/conf as
I expected).  
> Also, if you want to compile your own cassandra (to add logging messages), make sure
it's appears first on the pig classpath (use pig -secretDebugCmd to see the fully qualified
command line).
> The next thing I'm trying to figure out is why when widerows == true I'm STILL not seeing
more than 1024 columns :-( 
> will
> On Wed, Sep 26, 2012 at 3:42 PM, William Oberman <> wrote:
> Hi,
> I'm trying to figure out what's going on with my cassandra/hadoop/pig system.  I created
a "mini" copy of my main cassandra data by randomly subsampling to get ~50,000 keys.  I was
then writing pig scripts but also the equivalent operation using simple single threaded code
to double check pig.
> Of course my very first test failed.  After doing a pig DUMP on the raw data, what appears
to be happening is I'm only getting the first 1024 columns of a key.  After some googling,
this seems to be known behavior unless you add "?widerows=true" to the pig load URI. I tried
this, but it didn't seem to fix anything :-(   Here's the the start of my pig script:
> foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING CassandraStorage()
AS (key:chararray, columns:bag {column:tuple (name, value)});
> I'm using cassandra 1.1.5 from datastax rpms.  I'm using hadoop (0.20.2+923.418-1) and
pig (0.8.1+28.39-1) from cloudera rpms.
> What am I doing wrong?  Or, how I can enable debugging/logging to next figure out what
is going on?  I haven't had to debug hadoop+pig+cassandra much, other than doing DUMP/ILLUSTRATE
from pig.
> will

View raw message