Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2571418911 for ; Thu, 8 Oct 2015 21:49:07 +0000 (UTC) Received: (qmail 58449 invoked by uid 500); 8 Oct 2015 21:49:06 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 58412 invoked by uid 500); 8 Oct 2015 21:49:06 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 58402 invoked by uid 99); 8 Oct 2015 21:49:06 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Oct 2015 21:49:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 6BFFDC0BDA for ; Thu, 8 Oct 2015 21:49:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.15 X-Spam-Level: *** X-Spam-Status: No, score=3.15 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id WsLjVyOA1Sok for ; Thu, 8 Oct 2015 21:48:56 +0000 (UTC) Received: from mail-io0-f175.google.com (mail-io0-f175.google.com [209.85.223.175]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id DB3EC20F9F for ; Thu, 8 Oct 2015 21:48:55 +0000 (UTC) Received: by iow1 with SMTP id 1so73626240iow.1 for ; Thu, 08 Oct 2015 14:48:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-type; bh=Ju0fRkSg1BvvG+ghC7I3BH6fAaYYKHPCeJQT1FlVnzQ=; b=KNjeeuRCPdV1F7p8A6DZAmrk8DanZquhRw+Iao0PjtyW0N+QUzbHsDN3SMBoE3dcSW DFO0GvbLrilrY6Q6yZw4akxnU8PPys852wRo9ke0DjFyAe7GzWO8n2719fye1Nq/P7zg CyqEIA0W3QXLb0/fUxs1+nfFZ2bLnIPRNAuUVEeZU8aH4nNpxIJN6PQS3IGkJ1tm18NS xYjN/QpFV2hrlWoG39OhgTBQGP0OxIEollzP/Dng/jWZpvWlGoFroSabMZQzgL86di8+ apLFv3XtcwngOVpNiCNKeUSLKdV65YeYa/2cuGKAph1lS6KGK58LTMItzTZk1Y0DTuJr oU5w== X-Received: by 10.107.155.78 with SMTP id d75mr13112723ioe.44.1444340934866; Thu, 08 Oct 2015 14:48:54 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Nithin Asokan Date: Thu, 08 Oct 2015 21:48:44 +0000 Message-ID: Subject: Re: SparkPipeline possible avro reuse on cache() To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=001a1141bd525b82f205219ed42a --001a1141bd525b82f205219ed42a Content-Type: text/plain; charset=UTF-8 Thanks Josh. I logged https://issues.apache.org/jira/browse/CRUNCH-569 and will try submitting a patch for this. On Thu, Oct 8, 2015 at 1:51 PM Josh Wills wrote: > Yeah, I could see how that would happen. I think the move would be to > inject a deep copy inside of the RDD that is underneath a cached > PCollection. I can probably take a crack at a patch later this weekend, I > have a busy couple of days w/the baby and new job and what not. :) > > J > > On Thu, Oct 8, 2015 at 9:32 AM, Nithin Asokan wrote: > >> First I would like to thank everyone on the quick response and fixes on >> most issues. Great job everyone! >> >> I noticed that using cache() on PTable built using SparkPipeline seems to >> reuse object for downstream DoFn's. Here is an example that exhibits this >> behavior >> >> https://gist.github.com/nasokan/531b4ff9bf827d0835ab >> >> I would expect the output of this program to create a pair with same key, >> value. However, this produces Pair with different key value. I have tested >> this with text file input source and it works as expected. Removing cache() >> also produces expected result. So I'm suspecting this issue to be specific >> to avro and cache(). >> >> Any thoughts on this behavior? >> >> Thank you! >> Nithin >> > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > --001a1141bd525b82f205219ed42a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks Josh. I logged=C2=A0https://issues.apache.org/jira/browse/CRUNCH-= 569=C2=A0and will try submitting a patch for this.=C2=A0

On Thu, Oct 8, 2015 at 1:51 PM Josh = Wills <jwills@cloudera.com>= ; wrote:
Yeah, I c= ould see how that would happen. I think the move would be to inject a deep = copy inside of the RDD that is underneath a cached PCollection. I can proba= bly take a crack at a patch later this weekend, I have a busy couple of day= s w/the baby and new job and what not. :)

J

On = Thu, Oct 8, 2015 at 9:32 AM, Nithin Asokan <anithin19@gmail.com><= /span> wrote:
First I wo= uld like to thank everyone on the quick response and fixes on most issues. = Great job everyone!

I noticed that using cache() on PTab= le built using SparkPipeline seems to reuse object for downstream DoFn'= s. Here is an example that exhibits this behavior

=
I would expect the output of this program to create a pair w= ith same key, value. However, this produces Pair with different key value. = I have tested this with text file input source and it works as expected. Re= moving cache() also produces expected result. So I'm suspecting this is= sue to be specific to avro and cache().=C2=A0

Any = thoughts on this behavior?

Thank you!
Nithin



--
Director of Data Science
Twitt= er: @josh_wills=
--001a1141bd525b82f205219ed42a--