incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: Pig not reading all cassandra data
Date Thu, 17 Feb 2011 18:36:54 GMT
Thanks a lot for the help on this!

>From what I can tell that looks like a good solution.  Created
https://issues.apache.org/jira/browse/CASSANDRA-2184 to make that
change.

On Thu, Feb 17, 2011 at 11:52 AM, Matt Kennedy <stinkymatt@gmail.com> wrote:
> I have a resolution for how I'm dealing with this problem for my particular
> situation and I'd like to throw it out there to see if you think it should
> be integrated into the core Cassandra code.
>
> Just to repeat, the immediate workaround for this is to set
> -Dpig.splitCombination=false when you launch pig.
>
> However, we wanted to keep splitCombination on because it is a useful
> optimization for a lot of our use cases, so I went digging for the least
> intrusive way to keep the split combiner on, but also prevent it from
> combining splits that read from Cassandra.  My solution, which you are
> welcome to critique, is to change line 65 of
> http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilySplit.java
> such that it returns Long.MAX_VALUE instead of zero.
>
> That effectively turns off split combination in Pig 0.8 when reading from
> Cassandra, but leaves it on for everything else.  So far, I can't see any
> negative side effects from it.
>
> Thoughts?
>
>
> On Fri, Feb 11, 2011 at 3:37 PM, Matt Kennedy <stinkymatt@gmail.com> wrote:
>>
>> Sorry it has taken me a while to get back to this.  I'm still trying to
>> get to the bottom of this to find where the disconnect is between the column
>> family input format code and the Pig optimizer.
>>
>> I suspected that the problem was line 365 of:
>>
>> http://svn.apache.org/viewvc/pig/tags/release-0.8.0/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java?view=markup
>>
>> ...but I changed the ColumnFamilySplit.java file so that it returns -1
>> instead of 0, the result of which is that the Pig job will iterate over the
>> entirety of the cassandra data that it is supposed to, but it does so with
>> only one mapper.  It looks like the Pig map combiner isn't using the
>> split.getLength call to determine how the maps get combined as I originally
>> suspected.  I'll update when I figure more out.
>>
>> -Matt
>>
>> On Sat, Feb 5, 2011 at 1:01 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>>>
>>> On Fri, Feb 4, 2011 at 9:47 PM, Matt Kennedy <stinkymatt@gmail.com>
>>> wrote:
>>> > Found the culprit.  There is a new feature in Pig 0.8 that will try to
>>> > reduce the number of splits used to speed up the whole job.  Since the
>>> > ColumnFamilyInputFormat lists the input size as zero, this feature
>>> > eliminates all of the splits except for one.
>>> >
>>> > The workaround is to disable this feature for jobs that use
>>> > CassandraStorage
>>> > by setting -Dpig.splitCombination=false in the pig_cassandra script.
>>> >
>>> > Hope somebody finds this useful, you wouldn't believe how many
>>> > dead-ends I
>>> > ran down trying to figure this out.
>>>
>>> Ouch, thanks for tracking that down.
>>>
>>> What should CFIF be returning differently?  Do you mean the
>>> InputSplit.getLength?
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Mime
View raw message