Return-Path: X-Original-To: apmail-incubator-crunch-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1DDCFE1A4 for ; Wed, 6 Feb 2013 13:16:37 +0000 (UTC) Received: (qmail 4737 invoked by uid 500); 6 Feb 2013 13:16:36 -0000 Delivered-To: apmail-incubator-crunch-user-archive@incubator.apache.org Received: (qmail 4663 invoked by uid 500); 6 Feb 2013 13:16:36 -0000 Mailing-List: contact crunch-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-user@incubator.apache.org Delivered-To: mailing list crunch-user@incubator.apache.org Received: (qmail 4653 invoked by uid 99); 6 Feb 2013 13:16:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Feb 2013 13:16:36 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of stimme@gmail.com designates 209.85.223.176 as permitted sender) Received: from [209.85.223.176] (HELO mail-ie0-f176.google.com) (209.85.223.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Feb 2013 13:16:28 +0000 Received: by mail-ie0-f176.google.com with SMTP id k13so1863485iea.35 for ; Wed, 06 Feb 2013 05:16:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=50dtNCq9hUSbrQw+oZBtDlNQuA+Cfp+KLFaJa9EgwdE=; b=KdxYHEgAEVid3PAngZwwej1Diz4U0VC0j31sNWg9ll7cumM28FzqdOtekEt0aA5uKg v9R80qAbF96KQmlNcfQSHUGVnIGbHpIzEpfcGfoPSWXt2IUDkwdV7FIlxz04eOoUu75i 6P9yxVVSaOE0lSSkCVb4uDt0RmirMipKFmaGC9nAMaoR5bLbBa+UMrqyEovTrP9JdpD5 YBWOv+Jk6AoBiZjT68pG1LKMizMsRkLyalBuok2hZNgE3Hv895GHpKnYKE/ACG4VZ0zi DsCngwe6h64dUURjrZyLlA6BrWokRfZ44eM4nhLn3mFzA0Si/d0++Mvp1CSMmFhvgqYU v7BQ== X-Received: by 10.50.208.7 with SMTP id ma7mr6027680igc.26.1360156567513; Wed, 06 Feb 2013 05:16:07 -0800 (PST) MIME-Version: 1.0 Received: by 10.64.165.132 with HTTP; Wed, 6 Feb 2013 05:15:47 -0800 (PST) In-Reply-To: References: <9CA2213A-761B-49DC-B9A3-8E457B74F248@gmail.com> From: Tim van Heugten Date: Wed, 6 Feb 2013 14:15:47 +0100 Message-ID: Subject: Re: MemPipeline and context To: crunch-user@incubator.apache.org Content-Type: multipart/alternative; boundary=14dae93405ed0baa5504d50e21a8 X-Virus-Checked: Checked by ClamAV on apache.org --14dae93405ed0baa5504d50e21a8 Content-Type: text/plain; charset=ISO-8859-1 Usually just to alter the default execution plan. In this case where the crunch bug counteracted our own bug we used it to end up with the desired (*) output (in fact triggering the crunch bug). We now fixed our bug and are no longer pursuing the crunch bug. In general I would not expect the output to be dependent on the execution plan. Cheers, Tim *) Here we had a discrepancy between desired and (technically) correct output. On Wed, Feb 6, 2013 at 2:07 PM, Gabriel Reid wrote: > Thanks for all the info Tim. I've posted a bit more information on > CRUNCH-163, and will look into it more this evening. > > About calling materialize within pipelines, just to clarify: are you doing > this both to get a more efficient execution (i.e. alter the default > execution plan) as well as to get the correct output, or just one of those > two? > > Thanks, > > Gabriel > > > On Wed, Feb 6, 2013 at 11:53 AM, Tim van Heugten wrote: > >> To summarize: >> - When we saw data duplication, that was what we should have been >> expecting, given our implementation. That is not the issue. >> - Sometimes we didn't see data duplication. That is an issue: >> *Union sometimes ignores one of the input branches.* >> >> I created https://issues.apache.org/jira/browse/CRUNCH-163 for this >> issue. The tests singleUnion and doubleUnionWithoutMaterializeInbetween >> pass in my environment (0.4), the others fail. >> Besides breaking a union by adding a materialize after it I could also >> break it by performing a parallelDo after it or by just joining two read >> pCollections. >> >> >> Cheers, >> >> Tim >> >> >> >> On Tue, Feb 5, 2013 at 3:38 PM, Tim van Heugten wrote: >> >>> Hmmm, >>> >>> So we had a mistake in our code that emitted the data in both branches >>> before union2. >>> *And*, the crunch union also *failed to merge the data* in some >>> circumstance. My side-remark about not seeing the join happen was actually >>> bang on.. :-/ >>> >>> So the question now becomes, when does a union ignore one of its >>> incoming branches? >>> Apparently with materialization in the right spots we can force the >>> correct pipeline(*). >>> >>> Cheers, >>> >>> Tim van Heugten >>> >>> >>> *) Thereby exposing our bug, seemingly data duplication. But just to be >>> clear, this is actually the *correct* behavior. >>> >>> >>> >>> On Tue, Feb 5, 2013 at 3:18 PM, Tim van Heugten wrote: >>> >>>> Hi, >>>> >>>> It turns out the data in the two branches that are unioned in union2 is >>>> not mutually exclusive (counter to what I was expecting). Probably we >>>> should expect data duplication. >>>> >>>> However, this does still not explain why sometimes we find data >>>> duplication and sometimes we don't. >>>> >>>> Will keep you posted, >>>> >>>> Tim >>>> >>>> >>>> On Tue, Feb 5, 2013 at 11:32 AM, Tim van Heugten wrote: >>>> >>>>> Hi Gabriel, >>>>> >>>>> I've been unsuccessful so far to reproduce the issue in a controlled >>>>> environment. As said, its fragile, maybe the types involved play a role, so >>>>> when I tried to simplify those I broke the failure condition. >>>>> I decide it's time to try providing more information without giving an >>>>> explicit example. >>>>> >>>>> The pipeline we build is illustrated here: http://yuml.me/8ef99512. >>>>> Depending on where we materialize the data occurs twice in UP. >>>>> The EITPI job filters the exact opposite of the filter branch. In PWR >>>>> only data from EITPI is passed through, while the PITP data is used to >>>>> modify it. >>>>> Below you find the job names as executed when dataduplication occurs, >>>>> materializations occur before BTO(*) and after UP. >>>>> >>>>> "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch655004156/p4)" >>>>> >>>>> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch655004156/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch655004156/p2)" >>>>> >>>>> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch655004156/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch655004156/p8)" >>>>> >>>>> "[[Avro(target/stored/sIPhase)+S0+BTO]/[Avro(/tmp/crunch655004156/p8)]]+GBK+UP+Avro(/tmp/crunch655004156/p6)" >>>>> "Avro(/tmp/crunch655004156/p6)+GetData+Avro(/tmp/crunch655004156/p10)" >>>>> >>>>> "Avro(/tmp/crunch655004156/p6)+GetTraces+Avro(target/trace-dump/traces)" >>>>> >>>>> Here are the jobs performed when materialization is added between BTO >>>>> and gbk: >>>>> >>>>> "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch-551174870/p4)" >>>>> >>>>> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-551174870/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch-551174870/p2)" >>>>> >>>>> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-551174870/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch-551174870/p6)" >>>>> "Avro(/tmp/crunch-551174870/p6)+GBK+UP+Avro(/tmp/crunch-551174870/p8)" >>>>> >>>>> "Avro(/tmp/crunch-551174870/p8)+GetData+Avro(/tmp/crunch-551174870/p10)" >>>>> >>>>> "Avro(/tmp/crunch-551174870/p8)+GetTraces+Avro(target/trace-dump/traces)" >>>>> >>>>> Without changing changing anything else, the added materialization >>>>> fixes the issue of data duplication. >>>>> >>>>> If you have any clues how I can extract a clean working example I'm >>>>> happy to hear. >>>>> >>>>> >>>>> *) This materialization probably explains the second job, however, >>>>> where the filtered data is joined is lost on me. This is not the cause >>>>> though, with just one materialize at the end, after UP, the data count >>>>> still doubled. The jobs then look like this: >>>>> >>>>> "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch369510677/p4)" >>>>> >>>>> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch369510677/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch369510677/p6)" >>>>> >>>>> "[[Avro(target/stored/sIPhase)+S0+BTO]/[Avro(/tmp/crunch369510677/p6)]]+GBK+UP+Avro(/tmp/crunch369510677/p2)" >>>>> >>>>> "Avro(/tmp/crunch369510677/p2)+GetTraces+Avro(target/trace-dump/traces)" >>>>> "Avro(/tmp/crunch369510677/p2)+GetData+Avro(/tmp/crunch369510677/p8)" >>>>> >>>>> BR, >>>>> >>>>> Tim van Heugten >>>>> >>>>> >>>>> On Thu, Jan 31, 2013 at 9:27 PM, Gabriel Reid wrote: >>>>> >>>>>> Hi Tim, >>>>>> >>>>>> On 31 Jan 2013, at 10:45, Tim van Heugten wrote: >>>>>> >>>>>> > Hi Gabriel, >>>>>> > >>>>>> > For the most part it is similar to what was send around recently on >>>>>> this mailinglist, see: >>>>>> > From Dave Beech >>>>>> > Subject Question about mapreduce job planner >>>>>> > Date Tue, 15 Jan 2013 11:41:42 GMT >>>>>> > >>>>>> > So, the common path before multiple outputs branch is executed >>>>>> twice. Sometimes the issues seem related to unions though, i.e. multiple >>>>>> inputs. We seem to have been troubled by a grouped table parallelDo on a >>>>>> table-union-gbk that got its data twice (all grouped doubled in size). >>>>>> Inserting a materialize between the union and groupByKey solved the issue. >>>>>> > >>>>>> > These issues seem very fragile (so they're fixed easily by changing >>>>>> something that's irrelevant to the output), so usually we just add or >>>>>> remove a materialization to make it run again. >>>>>> > I'll see if I can cleanly reproduce the data duplication issue >>>>>> later this week. >>>>>> >>>>>> Ok, that would be great if you could replicate it in a small test, >>>>>> thanks! >>>>>> >>>>>> - Gabriel >>>>> >>>>> >>>>> >>>> >>> >> > --14dae93405ed0baa5504d50e21a8 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Usually just to alter the default execution plan.

In this case where= the crunch bug counteracted our own bug we used it to end up with the desi= red (*) output (in fact triggering the crunch bug). We now fixed our bug an= d are no longer pursuing the crunch bug.

In general I would not expect the output to be dependent on the executi= on plan.


Cheers,

Tim

*) Here we had a discrepancy between desired and (technically) correct = output.

On Wed, Feb 6, 2013 at 2:07 PM, G= abriel Reid <gabriel.reid@gmail.com> wrote:
Thanks for all the info Tim.=A0I've post= ed a bit more information on CRUNCH-163, and will look into it more this ev= ening.=A0

About calling materialize within pipelines, just to clarify:= are you doing this both to get a more efficient execution (i.e. alter the = default execution plan) as well as to get the correct output, or just one o= f those two?=A0

Thanks,

Gabriel


On Wed, Feb 6, 2013 at 11:53 AM, Tim van Heugten <stimme@gmail.= com> wrote:
To summarize:
- When we saw data duplica= tion, that was what we should have been expecting, given our implementation= . That is not the issue.
- Sometimes we didn't see data duplication. That is an issue:
=A0=A0= Union sometimes ignores one of the input branches.

I created https://issues.apache.org/jira/browse/CRUNCH-163 for = this issue. The tests singleUnion and doubleUnionWithoutMaterializeInbetwee= n pass in my environment (0.4), the others fail.
Besides breaking a union by adding a materialize after it I could also brea= k it by performing a parallelDo after it or by just joining two read pColle= ctions.


Cheers,

Tim



On Tue, Feb 5, 2013 at 3:38 PM, Tim van Heugten = <stimme@gmail.com> wrote:
Hmmm,

So we had a mistake in our code that emitted the data in both = branches before union2.
And, the crunch union also failed to m= erge the data in some circumstance. My side-remark about not seeing the= join happen was actually bang on.. :-/

So the question now becomes, when does a union ignore one of its incomi= ng branches?
Apparently with materialization in the r= ight spots we can force the correct pipeline(*).

Cheers,

Tim = van Heugten


*) Thereby exposing our bug, seemingly data duplication. But just t= o be clear, this is actually the correct behavior.


=
On Tue, Feb 5, 2013 at 3:18 PM, Tim van Heug= ten <stimme@gmail.com> wrote:
Hi,

It turns out the data in the two = branches that are unioned in union2 is not mutually exclusive (counter to w= hat I was expecting). Probably we should expect data duplication.

However, this does still not explain why sometimes we find data duplica= tion and sometimes we don't.

Will keep you posted,

Tim


On Tue, Feb 5, 2013 at 11:32 AM, Tim van= Heugten <stimme@gmail.com> wrote:
Hi Gabriel,

I've been unsuccessful so far to reproduce the issue= in a controlled environment. As said, its fragile, maybe the types involve= d play a role, so when I tried to simplify those I broke the failure condit= ion.
I decide it's time to try providing more information without givin= g an explicit example.

The pipeline we build is illustrated here: http://yuml.me/8ef99512= . Depending on where we materialize the data occurs twice in UP.
The EITPI job filters the exact opposite of the filter branch. In PWR only = data from EITPI is passed through, while the PITP data is used to modify it= .
Below you find the job names as executed when dataduplication occurs, = materializations occur before BTO(*) and after UP.
"Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch655004156= /p4)"
"[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch65= 5004156/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch655004156/p2)&= quot;
"[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch655004156/p4)]]+= GBK+PWR+BTO+Avro(/tmp/crunch655004156/p8)"
"[[Avro(target/stor= ed/sIPhase)+S0+BTO]/[Avro(/tmp/crunch655004156/p8)]]+GBK+UP+Avro(/tmp/crunc= h655004156/p6)"
"Avro(/tmp/crunch655004156/p6)+GetData+Avro(/tmp/crunch655004156/p10)&= quot;
"Avro(/tmp/crunch655004156/p6)+GetTraces+Avro(target/trace-du= mp/traces)"

Here are the jobs performed when materialization is= added between BTO and gbk:
"Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch-55117487= 0/p4)"
"[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-= 551174870/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch-551174870/p= 2)"
"[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-551174870/p4)]]= +GBK+PWR+BTO+Avro(/tmp/crunch-551174870/p6)"
"Avro(/tmp/crunch= -551174870/p6)+GBK+UP+Avro(/tmp/crunch-551174870/p8)"
"Avro(/t= mp/crunch-551174870/p8)+GetData+Avro(/tmp/crunch-551174870/p10)"
"Avro(/tmp/crunch-551174870/p8)+GetTraces+Avro(target/trace-dump/trace= s)"

Without changing changing anything else, the added material= ization fixes the issue of data duplication.

If you have any clues h= ow I can extract a clean working example I'm happy to hear.


*) This materialization probably explains the second job, however, = where the filtered data is joined is lost on me. This is not the cause thou= gh, with just one materialize at the end, after UP, the data count still do= ubled. The jobs then look like this:
"Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch369510677= /p4)"
"[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch36= 9510677/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch369510677/p6)"
"[[Av= ro(target/stored/sIPhase)+S0+BTO]/[Avro(/tmp/crunch369510677/p6)]]+GBK+UP+A= vro(/tmp/crunch369510677/p2)"
"Avro(/tmp/crunch369510677/p2)+GetTraces+Avro(target/trace-dump/traces= )"
"Avro(/tmp/crunch369510677/p2)+GetData+Avro(/tmp/crunch3695= 10677/p8)"

BR,

Tim van Heugten


On Thu, Jan 31, 2013 at 9:27 PM, Gabriel= Reid <gabriel.reid@gmail.com> wrote:
Hi Tim,

On 31 Jan 2013, at 10:45, Tim van Heugten <stimme@gmail.com> wrote:

> Hi Gabriel,
>
> For the most part it is similar to what was send around recently on th= is mailinglist, see:
> From =A0Dave Beech <d...@paraliatech.com>
> Subject =A0 =A0 =A0 Question about mapreduce job planner
> Date =A0Tue, 15 Jan 2013 11:41:42 GMT
>
> So, the common path before multiple outputs branch is executed twice. = Sometimes the issues seem related to unions though, i.e. multiple inputs. W= e seem to have been troubled by a grouped table parallelDo on a table-union= -gbk that got its data twice (all grouped doubled in size). Inserting a mat= erialize between the union and groupByKey solved the issue.
>
> These issues seem very fragile (so they're fixed easily by changin= g something that's irrelevant to the output), so usually we just add or= remove a materialization to make it run again.
> I'll see if I can cleanly reproduce the data duplication issue lat= er this week.

Ok, that would be great if you could replicate it in a small test, th= anks!

- Gabriel






--14dae93405ed0baa5504d50e21a8--