cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <>
Subject Re: Pig not reading all cassandra data
Date Thu, 17 Feb 2011 18:36:54 GMT
Thanks a lot for the help on this!

>From what I can tell that looks like a good solution.  Created to make that

On Thu, Feb 17, 2011 at 11:52 AM, Matt Kennedy <> wrote:
> I have a resolution for how I'm dealing with this problem for my particular
> situation and I'd like to throw it out there to see if you think it should
> be integrated into the core Cassandra code.
> Just to repeat, the immediate workaround for this is to set
> -Dpig.splitCombination=false when you launch pig.
> However, we wanted to keep splitCombination on because it is a useful
> optimization for a lot of our use cases, so I went digging for the least
> intrusive way to keep the split combiner on, but also prevent it from
> combining splits that read from Cassandra.  My solution, which you are
> welcome to critique, is to change line 65 of
> such that it returns Long.MAX_VALUE instead of zero.
> That effectively turns off split combination in Pig 0.8 when reading from
> Cassandra, but leaves it on for everything else.  So far, I can't see any
> negative side effects from it.
> Thoughts?
> On Fri, Feb 11, 2011 at 3:37 PM, Matt Kennedy <> wrote:
>> Sorry it has taken me a while to get back to this.  I'm still trying to
>> get to the bottom of this to find where the disconnect is between the column
>> family input format code and the Pig optimizer.
>> I suspected that the problem was line 365 of:
>> ...but I changed the file so that it returns -1
>> instead of 0, the result of which is that the Pig job will iterate over the
>> entirety of the cassandra data that it is supposed to, but it does so with
>> only one mapper.  It looks like the Pig map combiner isn't using the
>> split.getLength call to determine how the maps get combined as I originally
>> suspected.  I'll update when I figure more out.
>> -Matt
>> On Sat, Feb 5, 2011 at 1:01 AM, Jonathan Ellis <> wrote:
>>> On Fri, Feb 4, 2011 at 9:47 PM, Matt Kennedy <>
>>> wrote:
>>> > Found the culprit.  There is a new feature in Pig 0.8 that will try to
>>> > reduce the number of splits used to speed up the whole job.  Since the
>>> > ColumnFamilyInputFormat lists the input size as zero, this feature
>>> > eliminates all of the splits except for one.
>>> >
>>> > The workaround is to disable this feature for jobs that use
>>> > CassandraStorage
>>> > by setting -Dpig.splitCombination=false in the pig_cassandra script.
>>> >
>>> > Hope somebody finds this useful, you wouldn't believe how many
>>> > dead-ends I
>>> > ran down trying to figure this out.
>>> Ouch, thanks for tracking that down.
>>> What should CFIF be returning differently?  Do you mean the
>>> InputSplit.getLength?
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support

Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support

View raw message