Mailing-List: contact crunch-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: crunch-user@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of stimme@gmail.com designates
 209.85.223.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAA5C_puFxFAOTFD0GEJ__bsvi-1CUoBfR_mVYBn1xdZ+hY-=cg@mail.gmail.com>
References: 
 <CAEtpghwTrB0JNN3_97fu0V7d8fofuRKB0tcq_KiVDKVJ6QP34g@mail.gmail.com>
 <CAA5C_pusrDN=5u93L8k==_-xV5_nU-tEexO2N7SLtUUR_Ar9Xg@mail.gmail.com>
 <CAEtpghw2fiRRREjQQr3Jsc+hZT_2dYHVnj+OkZt1X9WXBjsrZQ@mail.gmail.com>
 <9CA2213A-761B-49DC-B9A3-8E457B74F248@gmail.com>
 <CAEtpghwajhY_MHn2C5SsGa4xjZ+3buefYgkRWDsxRfEn68gCPw@mail.gmail.com>
 <CAEtpghzxGFp5v2wgZ83bcbX9BemSN3Otddzah3sqMMCDTOTubQ@mail.gmail.com>
 <CAEtpghxaQ-u9xRFgMLhUdkm1Njie+JYjvwacXg+6Zdd-vkqaZw@mail.gmail.com>
 <CAEtpghyk0Kbm77dR=_ZUj6Q7BhH=0VEckMFAYksZ2A9o-4--1w@mail.gmail.com>
 <CAA5C_puFxFAOTFD0GEJ__bsvi-1CUoBfR_mVYBn1xdZ+hY-=cg@mail.gmail.com>
From: Tim van Heugten <stimme@gmail.com>
Date: Wed, 6 Feb 2013 14:15:47 +0100
Message-ID: 
 <CAEtpghyMCWOm8gzWeJ-rVTCu8GCAr4Hu+D5HzO2U7EK21b-fMg@mail.gmail.com>
Subject: Re: MemPipeline and context
To: crunch-user@incubator.apache.org
Content-Type: multipart/alternative; boundary=14dae93405ed0baa5504d50e21a8

--14dae93405ed0baa5504d50e21a8
Content-Type: text/plain; charset=ISO-8859-1

Usually just to alter the default execution plan.

In this case where the crunch bug counteracted our own bug we used it to
end up with the desired (*) output (in fact triggering the crunch bug). We
now fixed our bug and are no longer pursuing the crunch bug.

In general I would not expect the output to be dependent on the execution
plan.


Cheers,

Tim

*) Here we had a discrepancy between desired and (technically) correct
output.

On Wed, Feb 6, 2013 at 2:07 PM, Gabriel Reid <gabriel.reid@gmail.com> wrote:

> Thanks for all the info Tim. I've posted a bit more information on
> CRUNCH-163, and will look into it more this evening.
>
> About calling materialize within pipelines, just to clarify: are you doing
> this both to get a more efficient execution (i.e. alter the default
> execution plan) as well as to get the correct output, or just one of those
> two?
>
> Thanks,
>
> Gabriel
>
>
> On Wed, Feb 6, 2013 at 11:53 AM, Tim van Heugten <stimme@gmail.com> wrote:
>
>> To summarize:
>> - When we saw data duplication, that was what we should have been
>> expecting, given our implementation. That is not the issue.
>> - Sometimes we didn't see data duplication. That is an issue:
>>    *Union sometimes ignores one of the input branches.*
>>
>> I created https://issues.apache.org/jira/browse/CRUNCH-163 for this
>> issue. The tests singleUnion and doubleUnionWithoutMaterializeInbetween
>> pass in my environment (0.4), the others fail.
>> Besides breaking a union by adding a materialize after it I could also
>> break it by performing a parallelDo after it or by just joining two read
>> pCollections.
>>
>>
>> Cheers,
>>
>> Tim
>>
>>
>>
>>  On Tue, Feb 5, 2013 at 3:38 PM, Tim van Heugten <stimme@gmail.com>wrote:
>>
>>> Hmmm,
>>>
>>> So we had a mistake in our code that emitted the data in both branches
>>> before union2.
>>> *And*, the crunch union also *failed to merge the data* in some
>>> circumstance. My side-remark about not seeing the join happen was actually
>>> bang on.. :-/
>>>
>>> So the question now becomes, when does a union ignore one of its
>>> incoming branches?
>>> Apparently with materialization in the right spots we can force the
>>> correct pipeline(*).
>>>
>>> Cheers,
>>>
>>> Tim van Heugten
>>>
>>>
>>> *) Thereby exposing our bug, seemingly data duplication. But just to be
>>> clear, this is actually the *correct* behavior.
>>>
>>>
>>>
>>> On Tue, Feb 5, 2013 at 3:18 PM, Tim van Heugten <stimme@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> It turns out the data in the two branches that are unioned in union2 is
>>>> not mutually exclusive (counter to what I was expecting). Probably we
>>>> should expect data duplication.
>>>>
>>>> However, this does still not explain why sometimes we find data
>>>> duplication and sometimes we don't.
>>>>
>>>> Will keep you posted,
>>>>
>>>> Tim
>>>>
>>>>
>>>> On Tue, Feb 5, 2013 at 11:32 AM, Tim van Heugten <stimme@gmail.com>wrote:
>>>>
>>>>> Hi Gabriel,
>>>>>
>>>>> I've been unsuccessful so far to reproduce the issue in a controlled
>>>>> environment. As said, its fragile, maybe the types involved play a role, so
>>>>> when I tried to simplify those I broke the failure condition.
>>>>> I decide it's time to try providing more information without giving an
>>>>> explicit example.
>>>>>
>>>>> The pipeline we build is illustrated here: http://yuml.me/8ef99512.
>>>>> Depending on where we materialize the data occurs twice in UP.
>>>>> The EITPI job filters the exact opposite of the filter branch. In PWR
>>>>> only data from EITPI is passed through, while the PITP data is used to
>>>>> modify it.
>>>>> Below you find the job names as executed when dataduplication occurs,
>>>>> materializations occur before BTO(*) and after UP.
>>>>>
>>>>> "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch655004156/p4)"
>>>>>
>>>>> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch655004156/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch655004156/p2)"
>>>>>
>>>>> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch655004156/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch655004156/p8)"
>>>>>
>>>>> "[[Avro(target/stored/sIPhase)+S0+BTO]/[Avro(/tmp/crunch655004156/p8)]]+GBK+UP+Avro(/tmp/crunch655004156/p6)"
>>>>> "Avro(/tmp/crunch655004156/p6)+GetData+Avro(/tmp/crunch655004156/p10)"
>>>>>
>>>>> "Avro(/tmp/crunch655004156/p6)+GetTraces+Avro(target/trace-dump/traces)"
>>>>>
>>>>> Here are the jobs performed when materialization is added between BTO
>>>>> and gbk:
>>>>>
>>>>> "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch-551174870/p4)"
>>>>>
>>>>> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-551174870/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch-551174870/p2)"
>>>>>
>>>>> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-551174870/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch-551174870/p6)"
>>>>> "Avro(/tmp/crunch-551174870/p6)+GBK+UP+Avro(/tmp/crunch-551174870/p8)"
>>>>>
>>>>> "Avro(/tmp/crunch-551174870/p8)+GetData+Avro(/tmp/crunch-551174870/p10)"
>>>>>
>>>>> "Avro(/tmp/crunch-551174870/p8)+GetTraces+Avro(target/trace-dump/traces)"
>>>>>
>>>>> Without changing changing anything else, the added materialization
>>>>> fixes the issue of data duplication.
>>>>>
>>>>> If you have any clues how I can extract a clean working example I'm
>>>>> happy to hear.
>>>>>
>>>>>
>>>>> *) This materialization probably explains the second job, however,
>>>>> where the filtered data is joined is lost on me. This is not the cause
>>>>> though, with just one materialize at the end, after UP, the data count
>>>>> still doubled. The jobs then look like this:
>>>>>
>>>>> "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch369510677/p4)"
>>>>>
>>>>> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch369510677/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch369510677/p6)"
>>>>>
>>>>> "[[Avro(target/stored/sIPhase)+S0+BTO]/[Avro(/tmp/crunch369510677/p6)]]+GBK+UP+Avro(/tmp/crunch369510677/p2)"
>>>>>
>>>>> "Avro(/tmp/crunch369510677/p2)+GetTraces+Avro(target/trace-dump/traces)"
>>>>> "Avro(/tmp/crunch369510677/p2)+GetData+Avro(/tmp/crunch369510677/p8)"
>>>>>
>>>>> BR,
>>>>>
>>>>> Tim van Heugten
>>>>>
>>>>>
>>>>> On Thu, Jan 31, 2013 at 9:27 PM, Gabriel Reid <gabriel.reid@gmail.com>wrote:
>>>>>
>>>>>> Hi Tim,
>>>>>>
>>>>>> On 31 Jan 2013, at 10:45, Tim van Heugten <stimme@gmail.com> wrote:
>>>>>>
>>>>>> > Hi Gabriel,
>>>>>> >
>>>>>> > For the most part it is similar to what was send around recently on
>>>>>> this mailinglist, see:
>>>>>> > From  Dave Beech <d...@paraliatech.com>
>>>>>> > Subject       Question about mapreduce job planner
>>>>>> > Date  Tue, 15 Jan 2013 11:41:42 GMT
>>>>>> >
>>>>>> > So, the common path before multiple outputs branch is executed
>>>>>> twice. Sometimes the issues seem related to unions though, i.e. multiple
>>>>>> inputs. We seem to have been troubled by a grouped table parallelDo on a
>>>>>> table-union-gbk that got its data twice (all grouped doubled in size).
>>>>>> Inserting a materialize between the union and groupByKey solved the issue.
>>>>>> >
>>>>>> > These issues seem very fragile (so they're fixed easily by changing
>>>>>> something that's irrelevant to the output), so usually we just add or
>>>>>> remove a materialization to make it run again.
>>>>>> > I'll see if I can cleanly reproduce the data duplication issue
>>>>>> later this week.
>>>>>>
>>>>>> Ok, that would be great if you could replicate it in a small test,
>>>>>> thanks!
>>>>>>
>>>>>> - Gabriel
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

--14dae93405ed0baa5504d50e21a8
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Usually just to alter the default execution plan.<br><br>In this case where=
 the crunch bug counteracted our own bug we used it to end up with the desi=
red (*) output (in fact triggering the crunch bug). We now fixed our bug an=
d are no longer pursuing the crunch bug.<br>

<br>In general I would not expect the output to be dependent on the executi=
on plan.<br clear=3D"all"><div><br><br>Cheers,<br><br>Tim <br></div>
<br>*) Here we had a discrepancy between desired and (technically) correct =
output.<br><br><div class=3D"gmail_quote">On Wed, Feb 6, 2013 at 2:07 PM, G=
abriel Reid <span dir=3D"ltr">&lt;<a href=3D"mailto:gabriel.reid@gmail.com"=
 target=3D"_blank">gabriel.reid@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Thanks for all the info Tim.=A0I&#39;ve post=
ed a bit more information on CRUNCH-163, and will look into it more this ev=
ening.=A0<div>

<br></div><div>About calling materialize within pipelines, just to clarify:=
 are you doing this both to get a more efficient execution (i.e. alter the =
default execution plan) as well as to get the correct output, or just one o=
f those two?=A0</div>


<div><br></div><div>Thanks,</div><div><br></div><div>Gabriel</div><div clas=
s=3D"HOEnZb"><div class=3D"h5"><div><br></div><div><div><br><div class=3D"g=
mail_quote">On Wed, Feb 6, 2013 at 11:53 AM, Tim van Heugten <span dir=3D"l=
tr">&lt;<a href=3D"mailto:stimme@gmail.com" target=3D"_blank">stimme@gmail.=
com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">To summarize: <br>- When we saw data duplica=
tion, that was what we should have been expecting, given our implementation=
. That is not the issue. <br>


- Sometimes we didn&#39;t see data duplication. That is an issue:<br>=A0=A0=
 <b>Union sometimes ignores one of the input branches.</b><br>

<br>I created <a href=3D"https://issues.apache.org/jira/browse/CRUNCH-163" =
target=3D"_blank">https://issues.apache.org/jira/browse/CRUNCH-163</a> for =
this issue. The tests singleUnion and doubleUnionWithoutMaterializeInbetwee=
n pass in my environment (0.4), the others fail.<br>


Besides breaking a union by adding a materialize after it I could also brea=
k it by performing a parallelDo after it or by just joining two read pColle=
ctions.<br clear=3D"all"><div><br><br>Cheers,<br><br>Tim<br><br><br><br>


</div><div><div>
<div class=3D"gmail_quote">On Tue, Feb 5, 2013 at 3:38 PM, Tim van Heugten =
<span dir=3D"ltr">&lt;<a href=3D"mailto:stimme@gmail.com" target=3D"_blank"=
>stimme@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Hmmm,<br><br>So we had a mistake in our code that emitted the data in both =
branches before union2.<br><i>And</i>, the crunch union also <i>failed to m=
erge the data</i> in some circumstance. My side-remark about not seeing the=
 join happen was actually bang on.. :-/<br>


<br>So the question now becomes, when does a union ignore one of its incomi=
ng branches?<br clear=3D"all"><div>Apparently with materialization in the r=
ight spots we can force the correct pipeline(*).<br><br>Cheers,<br><br>Tim =
van Heugten</div>


<br><br>*) Thereby exposing our bug, seemingly data duplication. But just t=
o be clear, this is actually the <i>correct</i> behavior.<div><div><br><br>=
<br><div class=3D"gmail_quote">On Tue, Feb 5, 2013 at 3:18 PM, Tim van Heug=
ten <span dir=3D"ltr">&lt;<a href=3D"mailto:stimme@gmail.com" target=3D"_bl=
ank">stimme@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi,<br><br>It turns out the data in the two =
branches that are unioned in union2 is not mutually exclusive (counter to w=
hat I was expecting). Probably we should expect data duplication. <br>


<br>However, this does still not explain why sometimes we find data duplica=
tion and sometimes we don&#39;t.<br>
<br>Will keep you posted,<br clear=3D"all"><div><br>Tim <br></div><div><div=
>
<br><br><div class=3D"gmail_quote">On Tue, Feb 5, 2013 at 11:32 AM, Tim van=
 Heugten <span dir=3D"ltr">&lt;<a href=3D"mailto:stimme@gmail.com" target=
=3D"_blank">stimme@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex">


Hi Gabriel,<br><br>I&#39;ve been unsuccessful so far to reproduce the issue=
 in a controlled environment. As said, its fragile, maybe the types involve=
d play a role, so when I tried to simplify those I broke the failure condit=
ion.<br clear=3D"all">


<div>I decide it&#39;s time to try providing more information without givin=
g an explicit example.<br><br>The pipeline we build is illustrated here: <a=
 href=3D"http://yuml.me/8ef99512" target=3D"_blank">http://yuml.me/8ef99512=
</a>. Depending on where we materialize the data occurs twice in UP. <br>


The EITPI job filters the exact opposite of the filter branch. In PWR only =
data from EITPI is passed through, while the PITP data is used to modify it=
.<br>Below you find the job names as executed when dataduplication occurs, =
materializations occur before BTO(*) and after UP.<br>


&quot;Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch655004156=
/p4)&quot;<br>&quot;[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch65=
5004156/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch655004156/p2)&=
quot;<br>


&quot;[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch655004156/p4)]]+=
GBK+PWR+BTO+Avro(/tmp/crunch655004156/p8)&quot;<br>&quot;[[Avro(target/stor=
ed/sIPhase)+S0+BTO]/[Avro(/tmp/crunch655004156/p8)]]+GBK+UP+Avro(/tmp/crunc=
h655004156/p6)&quot;<br>


&quot;Avro(/tmp/crunch655004156/p6)+GetData+Avro(/tmp/crunch655004156/p10)&=
quot;<br>&quot;Avro(/tmp/crunch655004156/p6)+GetTraces+Avro(target/trace-du=
mp/traces)&quot;<br><br>Here are the jobs performed when materialization is=
 added between BTO and gbk:<br>


&quot;Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch-55117487=
0/p4)&quot;<br>&quot;[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-=
551174870/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch-551174870/p=
2)&quot;<br>


&quot;[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-551174870/p4)]]=
+GBK+PWR+BTO+Avro(/tmp/crunch-551174870/p6)&quot;<br>&quot;Avro(/tmp/crunch=
-551174870/p6)+GBK+UP+Avro(/tmp/crunch-551174870/p8)&quot;<br>&quot;Avro(/t=
mp/crunch-551174870/p8)+GetData+Avro(/tmp/crunch-551174870/p10)&quot;<br>


&quot;Avro(/tmp/crunch-551174870/p8)+GetTraces+Avro(target/trace-dump/trace=
s)&quot;<br><br>Without changing changing anything else, the added material=
ization fixes the issue of data duplication.<br><br>If you have any clues h=
ow I can extract a clean working example I&#39;m happy to hear.<br>


<br><br>*) This materialization probably explains the second job, however, =
where the filtered data is joined is lost on me. This is not the cause thou=
gh, with just one materialize at the end, after UP, the data count still do=
ubled. The jobs then look like this:<br>


&quot;Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch369510677=
/p4)&quot;<br>&quot;[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch36=
9510677/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch369510677/p6)&quot;<br>&quot;[[Av=
ro(target/stored/sIPhase)+S0+BTO]/[Avro(/tmp/crunch369510677/p6)]]+GBK+UP+A=
vro(/tmp/crunch369510677/p2)&quot;<br>


&quot;Avro(/tmp/crunch369510677/p2)+GetTraces+Avro(target/trace-dump/traces=
)&quot;<br>&quot;Avro(/tmp/crunch369510677/p2)+GetData+Avro(/tmp/crunch3695=
10677/p8)&quot;<br><br>BR,<br><br>Tim van Heugten</div><div>
<div>
<br><br><div class=3D"gmail_quote">On Thu, Jan 31, 2013 at 9:27 PM, Gabriel=
 Reid <span dir=3D"ltr">&lt;<a href=3D"mailto:gabriel.reid@gmail.com" targe=
t=3D"_blank">gabriel.reid@gmail.com</a>&gt;</span> wrote:<br><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex">


Hi Tim,<br>
<div><br>
On 31 Jan 2013, at 10:45, Tim van Heugten &lt;<a href=3D"mailto:stimme@gmai=
l.com" target=3D"_blank">stimme@gmail.com</a>&gt; wrote:<br>
<br>
&gt; Hi Gabriel,<br>
&gt;<br>
&gt; For the most part it is similar to what was send around recently on th=
is mailinglist, see:<br>
&gt; From =A0Dave Beech &lt;<a href=3D"mailto:d...@paraliatech.com" target=
=3D"_blank">d...@paraliatech.com</a>&gt;<br>
&gt; Subject =A0 =A0 =A0 Question about mapreduce job planner<br>
&gt; Date =A0Tue, 15 Jan 2013 11:41:42 GMT<br>
&gt;<br>
&gt; So, the common path before multiple outputs branch is executed twice. =
Sometimes the issues seem related to unions though, i.e. multiple inputs. W=
e seem to have been troubled by a grouped table parallelDo on a table-union=
-gbk that got its data twice (all grouped doubled in size). Inserting a mat=
erialize between the union and groupByKey solved the issue.<br>


&gt;<br>
&gt; These issues seem very fragile (so they&#39;re fixed easily by changin=
g something that&#39;s irrelevant to the output), so usually we just add or=
 remove a materialization to make it run again.<br>
&gt; I&#39;ll see if I can cleanly reproduce the data duplication issue lat=
er this week.<br>
<br>
</div>Ok, that would be great if you could replicate it in a small test, th=
anks!<br>
<span><font color=3D"#888888"><br>
- Gabriel</font></span></blockquote></div><br>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br></div></div>
</div></div></blockquote></div><br>

--14dae93405ed0baa5504d50e21a8--