Sorry it has taken me a while to get back to this.  I'm still trying to get to the bottom of this to find where the disconnect is between the column family input format code and the Pig optimizer.

I suspected that the problem was line 365 of:
http://svn.apache.org/viewvc/pig/tags/release-0.8.0/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java?view=markup

...but I changed the ColumnFamilySplit.java file so that it returns -1 instead of 0, the result of which is that the Pig job will iterate over the entirety of the cassandra data that it is supposed to, but it does so with only one mapper.  It looks like the Pig map combiner isn't using the split.getLength call to determine how the maps get combined as I originally suspected.  I'll update when I figure more out.

-Matt

On Sat, Feb 5, 2011 at 1:01 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
On Fri, Feb 4, 2011 at 9:47 PM, Matt Kennedy <stinkymatt@gmail.com> wrote:
> Found the culprit.  There is a new feature in Pig 0.8 that will try to
> reduce the number of splits used to speed up the whole job.  Since the
> ColumnFamilyInputFormat lists the input size as zero, this feature
> eliminates all of the splits except for one.
>
> The workaround is to disable this feature for jobs that use CassandraStorage
> by setting -Dpig.splitCombination=false in the pig_cassandra script.
>
> Hope somebody finds this useful, you wouldn't believe how many dead-ends I
> ran down trying to figure this out.

Ouch, thanks for tracking that down.

What should CFIF be returning differently?  Do you mean the
InputSplit.getLength?

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com