Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 92816 invoked from network); 17 Feb 2011 17:53:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Feb 2011 17:53:30 -0000 Received: (qmail 40364 invoked by uid 500); 17 Feb 2011 17:53:28 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 40184 invoked by uid 500); 17 Feb 2011 17:53:25 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 40176 invoked by uid 99); 17 Feb 2011 17:53:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Feb 2011 17:53:25 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of stinkymatt@gmail.com designates 209.85.213.44 as permitted sender) Received: from [209.85.213.44] (HELO mail-yw0-f44.google.com) (209.85.213.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Feb 2011 17:53:19 +0000 Received: by ywk9 with SMTP id 9so1360622ywk.31 for ; Thu, 17 Feb 2011 09:52:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=KOrl6MGwjTeVhwNA9PG0m2XKW2H2htciAUvxRV9DLxM=; b=YPvnuZ0QW/GoW5TX1NzC1UzOLQb1C9KzRhQkrpsy2TSk7l1eV7BWUCDj+IQNxdXKwR T3QRWrLrmqOmgrNewGvzR8W6DaH3kOJovJ9byJ10fqgV+rhzN2bK+rcNBF7IKJSK7J1o kb9hAnreNFm4AMoG2Dnj9opEcNOksI8PEA2WI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Pt6t1jpe3Vq/uSoP9U+RhMogXK8JkH+FPhDr72ecVdewJvIrcR7v8Hi/NAXnYGks7l pN+5MwMXssKt3IWTLQM6v2lvC2Q/tEaKTAQEfudFiKyr4NPNTFcFSez/qiaIuSPcdeE0 p7w71Kurc+QJdoydo1XYL2B9xSfr79AhtVXww= MIME-Version: 1.0 Received: by 10.150.157.1 with SMTP id f1mr2548314ybe.83.1297965178703; Thu, 17 Feb 2011 09:52:58 -0800 (PST) Received: by 10.151.44.4 with HTTP; Thu, 17 Feb 2011 09:52:58 -0800 (PST) In-Reply-To: References: Date: Thu, 17 Feb 2011 12:52:58 -0500 Message-ID: Subject: Re: Pig not reading all cassandra data From: Matt Kennedy To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=000e0cd5a01c685133049c7e1106 --000e0cd5a01c685133049c7e1106 Content-Type: text/plain; charset=ISO-8859-1 I have a resolution for how I'm dealing with this problem for my particular situation and I'd like to throw it out there to see if you think it should be integrated into the core Cassandra code. Just to repeat, the immediate workaround for this is to set -Dpig.splitCombination=false when you launch pig. However, we wanted to keep splitCombination on because it is a useful optimization for a lot of our use cases, so I went digging for the least intrusive way to keep the split combiner on, but also prevent it from combining splits that read from Cassandra. My solution, which you are welcome to critique, is to change line 65 of http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilySplit.javasuch that it returns Long.MAX_VALUE instead of zero. That effectively turns off split combination in Pig 0.8 when reading from Cassandra, but leaves it on for everything else. So far, I can't see any negative side effects from it. Thoughts? On Fri, Feb 11, 2011 at 3:37 PM, Matt Kennedy wrote: > Sorry it has taken me a while to get back to this. I'm still trying to get > to the bottom of this to find where the disconnect is between the column > family input format code and the Pig optimizer. > > I suspected that the problem was line 365 of: > > http://svn.apache.org/viewvc/pig/tags/release-0.8.0/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java?view=markup > > ...but I changed the ColumnFamilySplit.java file so that it returns -1 > instead of 0, the result of which is that the Pig job will iterate over the > entirety of the cassandra data that it is supposed to, but it does so with > only one mapper. It looks like the Pig map combiner isn't using the > split.getLength call to determine how the maps get combined as I originally > suspected. I'll update when I figure more out. > > -Matt > > > On Sat, Feb 5, 2011 at 1:01 AM, Jonathan Ellis wrote: > >> On Fri, Feb 4, 2011 at 9:47 PM, Matt Kennedy >> wrote: >> > Found the culprit. There is a new feature in Pig 0.8 that will try to >> > reduce the number of splits used to speed up the whole job. Since the >> > ColumnFamilyInputFormat lists the input size as zero, this feature >> > eliminates all of the splits except for one. >> > >> > The workaround is to disable this feature for jobs that use >> CassandraStorage >> > by setting -Dpig.splitCombination=false in the pig_cassandra script. >> > >> > Hope somebody finds this useful, you wouldn't believe how many dead-ends >> I >> > ran down trying to figure this out. >> >> Ouch, thanks for tracking that down. >> >> What should CFIF be returning differently? Do you mean the >> InputSplit.getLength? >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of DataStax, the source for professional Cassandra support >> http://www.datastax.com >> > > --000e0cd5a01c685133049c7e1106 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I have a resolution for how I'm dealing with this problem for my partic= ular situation and I'd like to throw it out there to see if you think i= t should be integrated into the core Cassandra code.

Just to repeat,= the immediate workaround for this is to set -Dpig.splitCombination=3Dfalse= when you launch pig.

However, we wanted to keep splitCombination on because it is a useful o= ptimization for a lot of our use cases, so I went digging for the least int= rusive way to keep the split combiner on, but also prevent it from combinin= g splits that read from Cassandra.=A0 My solution, which you are welcome to= critique, is to change line 65 of On Fri, Feb 11, 2011 at 3:37 PM, Matt Kennedy <stinkymatt@gmail.com> wrote:=
Sorry it has taken me a while to get back to this.=A0 I'm still trying = to get to the bottom of this to find where the disconnect is between the co= lumn family input format code and the Pig optimizer.

I suspected tha= t the problem was line 365 of:
http://svn.apache.org/viewvc/pig/tags/release-0.8.0/src/or= g/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java?view=3Dmar= kup

...but I changed the ColumnFamilySplit.java file so that it returns -1 = instead of 0, the result of which is that the Pig job will iterate over the= entirety of the cassandra data that it is supposed to, but it does so with= only one mapper.=A0 It looks like the Pig map combiner isn't using the= split.getLength call to determine how the maps get combined as I originall= y suspected.=A0 I'll update when I figure more out.

-Matt


On Sat, Feb 5, 2011 at 1:01 AM, Jonathan Ellis = <jbellis@gmail.co= m> wrote:
On Fri, Feb 4, 2011 at 9:47 PM, Matt Kennedy <stinkymatt@gmail.com> wrote: > Found the culprit.=A0 There is a new feature in Pig 0.8 that will try = to
> reduce the number of splits used to speed up the whole job.=A0 Since t= he
> ColumnFamilyInputFormat lists the input size as zero, this feature
> eliminates all of the splits except for one.
>
> The workaround is to disable this feature for jobs that use CassandraS= torage
> by setting -Dpig.splitCombination=3Dfalse in the pig_cassandra script.=
>
> Hope somebody finds this useful, you wouldn't believe how many dea= d-ends I
> ran down trying to figure this out.

Ouch, thanks for tracking that down.

What should CFIF be returning differently? =A0Do you mean the
InputSplit.getLength?

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.c= om


--000e0cd5a01c685133049c7e1106--