I have a resolution for how I'm dealing with this problem for my particular situation and I'd like to throw it out there to see if you think it should be integrated into the core Cassandra code.
Just to repeat, the immediate workaround for this is to set -Dpig.splitCombination=false when you launch pig.
However, we wanted to keep splitCombination on because it is a useful optimization for a lot of our use cases, so I went digging for the least intrusive way to keep the split combiner on, but also prevent it from combining splits that read from Cassandra. My solution, which you are welcome to critique, is to change line 65 of http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilySplit.java such that it returns Long.MAX_VALUE instead of zero.
That effectively turns off split combination in Pig 0.8 when reading from Cassandra, but leaves it on for everything else. So far, I can't see any negative side effects from it.
Sorry it has taken me a while to get back to this. I'm still trying to get to the bottom of this to find where the disconnect is between the column family input format code and the Pig optimizer.
I suspected that the problem was line 365 of:
...but I changed the ColumnFamilySplit.java file so that it returns -1 instead of 0, the result of which is that the Pig job will iterate over the entirety of the cassandra data that it is supposed to, but it does so with only one mapper. It looks like the Pig map combiner isn't using the split.getLength call to determine how the maps get combined as I originally suspected. I'll update when I figure more out.
-MattOn Sat, Feb 5, 2011 at 1:01 AM, Jonathan Ellis <firstname.lastname@example.org> wrote:
On Fri, Feb 4, 2011 at 9:47 PM, Matt Kennedy <email@example.com> wrote:Ouch, thanks for tracking that down.
> Found the culprit. There is a new feature in Pig 0.8 that will try to
> reduce the number of splits used to speed up the whole job. Since the
> ColumnFamilyInputFormat lists the input size as zero, this feature
> eliminates all of the splits except for one.
> The workaround is to disable this feature for jobs that use CassandraStorage
> by setting -Dpig.splitCombination=false in the pig_cassandra script.
> Hope somebody finds this useful, you wouldn't believe how many dead-ends I
> ran down trying to figure this out.
What should CFIF be returning differently? Do you mean the
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support