Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 21899 invoked from network); 17 Feb 2011 18:37:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Feb 2011 18:37:29 -0000 Received: (qmail 14614 invoked by uid 500); 17 Feb 2011 18:37:27 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 14402 invoked by uid 500); 17 Feb 2011 18:37:24 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 14393 invoked by uid 99); 17 Feb 2011 18:37:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Feb 2011 18:37:22 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jbellis@gmail.com designates 209.85.213.44 as permitted sender) Received: from [209.85.213.44] (HELO mail-yw0-f44.google.com) (209.85.213.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Feb 2011 18:37:15 +0000 Received: by ywk9 with SMTP id 9so1380778ywk.31 for ; Thu, 17 Feb 2011 10:36:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=k8nc1VrEQrcJtAyyOmArgUOoK9PQxWSaOeOPveinwbg=; b=M4zbXpUYsUHUzj6+8q2LKmb5WY5KT5rW0sRxzsK3gaVcMSYKvrvrviCcn8CtS5QH3d vqTNFEmWk/1nhyA9cCUXqOdkbYuGwivghUX/6khuQmLkYqCc5EiWrAQ+mHklaxzV6hzA L51ijQxNvvnE4igBc5lDHkllSqDx/JP1vwfc4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=PIQ0+yyw9mbp7+fYQr8eS0PgKB7zcx14yuLU3Wur9l0Kc7wHx8VTXMD1w1fNvL8WWh 3gprZ3le7JXGudwufNEKahcRzV0x6tf4twbRDwA9oB4BcO37jm7ELs8RKGKQ02rIVCwN 7qiCWbuqgKUX6NRIgUuj7dwMnCpSbLTIm173g= MIME-Version: 1.0 Received: by 10.236.185.202 with SMTP id u50mr3255158yhm.52.1297967814450; Thu, 17 Feb 2011 10:36:54 -0800 (PST) Received: by 10.236.105.237 with HTTP; Thu, 17 Feb 2011 10:36:54 -0800 (PST) In-Reply-To: References: Date: Thu, 17 Feb 2011 12:36:54 -0600 Message-ID: Subject: Re: Pig not reading all cassandra data From: Jonathan Ellis To: user Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Thanks a lot for the help on this! >From what I can tell that looks like a good solution. Created https://issues.apache.org/jira/browse/CASSANDRA-2184 to make that change. On Thu, Feb 17, 2011 at 11:52 AM, Matt Kennedy wrote= : > I have a resolution for how I'm dealing with this problem for my particul= ar > situation and I'd like to throw it out there to see if you think it shoul= d > be integrated into the core Cassandra code. > > Just to repeat, the immediate workaround for this is to set > -Dpig.splitCombination=3Dfalse when you launch pig. > > However, we wanted to keep splitCombination on because it is a useful > optimization for a lot of our use cases, so I went digging for the least > intrusive way to keep the split combiner on, but also prevent it from > combining splits that read from Cassandra.=A0 My solution, which you are > welcome to critique, is to change line 65 of > http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandr= a/hadoop/ColumnFamilySplit.java > such that it returns Long.MAX_VALUE instead of zero. > > That effectively turns off split combination in Pig 0.8 when reading from > Cassandra, but leaves it on for everything else.=A0 So far, I can't see a= ny > negative side effects from it. > > Thoughts? > > > On Fri, Feb 11, 2011 at 3:37 PM, Matt Kennedy wrot= e: >> >> Sorry it has taken me a while to get back to this.=A0 I'm still trying t= o >> get to the bottom of this to find where the disconnect is between the co= lumn >> family input format code and the Pig optimizer. >> >> I suspected that the problem was line 365 of: >> >> http://svn.apache.org/viewvc/pig/tags/release-0.8.0/src/org/apache/pig/b= ackend/hadoop/executionengine/util/MapRedUtil.java?view=3Dmarkup >> >> ...but I changed the ColumnFamilySplit.java file so that it returns -1 >> instead of 0, the result of which is that the Pig job will iterate over = the >> entirety of the cassandra data that it is supposed to, but it does so wi= th >> only one mapper.=A0 It looks like the Pig map combiner isn't using the >> split.getLength call to determine how the maps get combined as I origina= lly >> suspected.=A0 I'll update when I figure more out. >> >> -Matt >> >> On Sat, Feb 5, 2011 at 1:01 AM, Jonathan Ellis wrote= : >>> >>> On Fri, Feb 4, 2011 at 9:47 PM, Matt Kennedy >>> wrote: >>> > Found the culprit.=A0 There is a new feature in Pig 0.8 that will try= to >>> > reduce the number of splits used to speed up the whole job.=A0 Since = the >>> > ColumnFamilyInputFormat lists the input size as zero, this feature >>> > eliminates all of the splits except for one. >>> > >>> > The workaround is to disable this feature for jobs that use >>> > CassandraStorage >>> > by setting -Dpig.splitCombination=3Dfalse in the pig_cassandra script= . >>> > >>> > Hope somebody finds this useful, you wouldn't believe how many >>> > dead-ends I >>> > ran down trying to figure this out. >>> >>> Ouch, thanks for tracking that down. >>> >>> What should CFIF be returning differently? =A0Do you mean the >>> InputSplit.getLength? >>> >>> -- >>> Jonathan Ellis >>> Project Chair, Apache Cassandra >>> co-founder of DataStax, the source for professional Cassandra support >>> http://www.datastax.com >> > > --=20 Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com