Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CABbqrY=PZ2Cj=+sOHmfwZM=T0P=TNSFTE8iFJAXwfHE5GfMN_Q@mail.gmail.com>
References: 
 <CABbqrYmL66KypDqjMP_Jgy=+MdAQAGFp3Z9rWEFQ1pHQy5XY6A@mail.gmail.com>
	<CAC79LcZUUSi1RM_Y-hin59OLgACsznJdvojQ6022MsfkKMaDog@mail.gmail.com>
	<CABbqrYkHCrb2-JbKFYX-eiLURt8e1cH-=U6MeM4nA70LDn+tYw@mail.gmail.com>
	<CABbqrY=PZ2Cj=+sOHmfwZM=T0P=TNSFTE8iFJAXwfHE5GfMN_Q@mail.gmail.com>
Date: Fri, 23 Oct 2015 10:09:26 -0700
Message-ID: 
 <CABc3QxFgdusyZ6tRPDoYEDMxVoz6b7DuvyRst_+w+WaYOgYcCQ@mail.gmail.com>
Subject: Re: Reuse PCollection / fork processing
From: Everett Anderson <everett@nuna.com>
To: user@crunch.apache.org
Content-Type: multipart/alternative; boundary=047d7b6250a8839a920522c8ac9b

--047d7b6250a8839a920522c8ac9b
Content-Type: text/plain; charset=UTF-8

Hi Rushi,

What's happening inside your filter() method? What's the boolean flag? Is
it calling pipeline.run()?

It seems like unless you call pipeline.run() or pipeline.done(), Crunch
won't actually perform work and write out the tables to disk before the
calls to processFinal, where it tries to read them back from disk.


On Fri, Oct 23, 2015 at 9:41 AM, Rushi <hrishi.engineer@gmail.com> wrote:

> Does anyone have any idea why this might be happening? Is it possible that
> after 'done' is called, one of the paths completes processing first, the
> staging data gets cleared and thus causing the exception to be thrown for
> the other path?
>
> Thanks.
>
> On Wed, Oct 21, 2015 at 3:01 PM, Rushi <hrishi.engineer@gmail.com> wrote:
>
>> Thanks for replying.
>>
>> Actually, I'm not calling done in between the sections, but only at the
>> end. The getPipeline().run() call in the processIntermediate() method is
>> commented out (I was trying to see if that would help but it didn't so I
>> commented it).
>>
>>
>> On Wed, Oct 21, 2015 at 2:05 PM, David Ortiz <dpo5003@gmail.com> wrote:
>>
>>> Don't call done in between sections where you use the PCollection and it
>>> should work.
>>>
>>> On Wed, Oct 21, 2015 at 2:57 PM Rushi <hrishi.engineer@gmail.com> wrote:
>>>
>>>> In Crunch, is it possible to reuse a PCollection multiple times for
>>>> different purposes in the same pipeline run? My pseudocode looks something
>>>> like the following, but I get an error File does not exist:
>>>> /tmp/crunch-..  when I run it. If I comment out the second processing
>>>> path (processFinal(paths.second()) line in process() method) I do not get
>>>> the error and the pipeline executes successfully.
>>>>
>>>> // 1. Entry point
>>>> *public int run() {*
>>>>   process();
>>>>
>>>>   getPipeline().done();
>>>> *} // end of run()*
>>>>
>>>>
>>>> // 2.
>>>> *private void process() {*
>>>>
>>>>   Pair<Path, Path> paths = processIntermediate();
>>>>
>>>>   PTable<String, String> ret1 = processFinal(paths.first());
>>>>   PTable<String, String> ret2 = processFinal(paths.second());
>>>>
>>>>   return ret1.union(ret2);
>>>>
>>>> *} // end of process()*
>>>>
>>>>
>>>> // 3.
>>>> *private Pair<Path, Path> processIntermediate() {*
>>>>
>>>>   PTable<String, Integer> data = ...; // read data from wherever
>>>>
>>>>   // filter data from the input
>>>>   Path path1 = filter(data, fs, true);   // filter() will write a
>>>> PCollection to an AvroFileSourceTarget and return its path, which will be
>>>> used later to read the collection back and do further processing.
>>>>   Path path2 = filter(data, fs, false);
>>>>
>>>>   // getPipeline().run();
>>>>
>>>>   return Pair.of(path1, path2);
>>>>
>>>> *} // end of **processIntermediate*
>>>>
>>>>
>>>> // 4.
>>>> *private PTable<String, String> processFinal(Path path) {*
>>>>
>>>>   PCollection<String> table = getPipeline().read(new
>>>> AvroFileSource<>(path), records(Strings));
>>>>
>>>>   return table.parallelDo(...);
>>>>
>>>> *} // end of processFinal*
>>>>
>>>>
>>>> I imagine I could probably use Oozie workflow actions to simplify the
>>>> processing but if this is just a matter of syntax/rearranging the code, I
>>>> would like to know it.
>>>>
>>>> *Thanks in advance!*
>>>>
>>>
>>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

--047d7b6250a8839a920522c8ac9b
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Rushi,<div><br></div><div>What&#39;s happening inside y=
our filter() method? What&#39;s the boolean flag? Is it calling pipeline.ru=
n()?</div><div><br></div><div>It seems like unless you call pipeline.run() =
or pipeline.done(), Crunch won&#39;t actually perform work and write out th=
e tables to disk before the calls to processFinal, where it tries to read t=
hem back from disk.</div><div><br></div><div><br></div><div><br></div><div>=
<br></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">O=
n Fri, Oct 23, 2015 at 9:41 AM, Rushi <span dir=3D"ltr">&lt;<a href=3D"mail=
to:hrishi.engineer@gmail.com" target=3D"_blank">hrishi.engineer@gmail.com</=
a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0=
 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><di=
v class=3D"gmail_default" style=3D"font-family:comic sans ms,sans-serif;col=
or:#0b5394">Does anyone have any idea why this might be happening? Is it po=
ssible that after &#39;done&#39; is called, one of the paths completes proc=
essing first, the staging data gets cleared and thus causing the exception =
to be thrown for the other path?</div><div class=3D"gmail_default" style=3D=
"font-family:comic sans ms,sans-serif;color:#0b5394"><br></div><div class=
=3D"gmail_default" style=3D"font-family:comic sans ms,sans-serif;color:#0b5=
394">Thanks.</div><div><div class=3D"h5"><div class=3D"gmail_extra"><br><di=
v class=3D"gmail_quote">On Wed, Oct 21, 2015 at 3:01 PM, Rushi <span dir=3D=
"ltr">&lt;<a href=3D"mailto:hrishi.engineer@gmail.com" target=3D"_blank">hr=
ishi.engineer@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex"><div dir=3D"ltr"><div style=3D"font-family:comic sans ms,sans-serif;co=
lor:#0b5394">Thanks for replying.=C2=A0</div><div style=3D"font-family:comi=
c sans ms,sans-serif;color:#0b5394"><br></div><div style=3D"font-family:com=
ic sans ms,sans-serif;color:#0b5394">Actually, I&#39;m not calling done in =
between the sections, but only at the end. The getPipeline().run() call in =
the processIntermediate() method is commented out (I was trying to see if t=
hat would help but it didn&#39;t so I commented it).</div><div style=3D"fon=
t-family:comic sans ms,sans-serif;color:#0b5394"><br></div></div><div><div>=
<div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Oct 21, 2=
015 at 2:05 PM, David Ortiz <span dir=3D"ltr">&lt;<a href=3D"mailto:dpo5003=
@gmail.com" target=3D"_blank">dpo5003@gmail.com</a>&gt;</span> wrote:<br><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex"><div dir=3D"ltr">Don&#39;t call done in betwee=
n sections where you use the PCollection and it should work.</div><div><div=
><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Wed, Oct 21, 2015 at 2:=
57 PM Rushi &lt;<a href=3D"mailto:hrishi.engineer@gmail.com" target=3D"_bla=
nk">hrishi.engineer@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex"><div dir=3D"ltr"><div style=3D"font-family:&#39;comic sans ms&#39=
;,sans-serif;color:rgb(11,83,148)">In Crunch, is it possible to reuse a PCo=
llection multiple times for different purposes in the same pipeline run? My=
 pseudocode looks something like the following, but I get an error=C2=A0<sp=
an style=3D"color:rgb(68,68,68);font-family:Monaco,Menlo,Consolas,&#39;Cour=
ier New&#39;,monospace;font-size:12px;line-height:20px;white-space:pre-wrap=
;background-color:rgb(245,245,245)">File does not exist: /tmp/crunch-..</sp=
an>=C2=A0 when I run it. If I comment out the second processing path (proce=
ssFinal(paths.second()) line in process() method) I do not get the error an=
d the pipeline executes successfully.</div><div style=3D"font-family:&#39;c=
omic sans ms&#39;,sans-serif;color:rgb(11,83,148)"><br></div><div style=3D"=
font-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)">// 1. =
Entry point<br></div><div style=3D"font-family:&#39;comic sans ms&#39;,sans=
-serif;color:rgb(11,83,148)"><b>public int run() {</b></div><div style=3D"f=
ont-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)">=C2=A0 =
process();</div><div style=3D"font-family:&#39;comic sans ms&#39;,sans-seri=
f;color:rgb(11,83,148)"><br></div><div style=3D"font-family:&#39;comic sans=
 ms&#39;,sans-serif;color:rgb(11,83,148)">=C2=A0 getPipeline().done();</div=
><div style=3D"font-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,=
83,148)"><b>} // end of run()</b></div><div style=3D"font-family:&#39;comic=
 sans ms&#39;,sans-serif;color:rgb(11,83,148)"><b><br></b></div><div style=
=3D"font-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)"><b=
r></div><div style=3D"font-family:&#39;comic sans ms&#39;,sans-serif;color:=
rgb(11,83,148)">// 2.</div><div style=3D"font-family:&#39;comic sans ms&#39=
;,sans-serif;color:rgb(11,83,148)"><b>private void process() {</b></div><di=
v style=3D"font-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,1=
48)"><br></div><div style=3D"font-family:&#39;comic sans ms&#39;,sans-serif=
;color:rgb(11,83,148)">=C2=A0 Pair&lt;Path, Path&gt; paths =3D processInter=
mediate();<br></div><div style=3D"font-family:&#39;comic sans ms&#39;,sans-=
serif;color:rgb(11,83,148)"><br></div><div style=3D"font-family:&#39;comic =
sans ms&#39;,sans-serif;color:rgb(11,83,148)">=C2=A0 PTable&lt;String, Stri=
ng&gt; ret1 =3D processFinal(paths.first());</div><div style=3D"font-family=
:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)">=C2=A0 PTable&lt;=
String, String&gt; ret2 =3D=C2=A0processFinal(paths.second());</div><div st=
yle=3D"font-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)"=
><br></div><div style=3D"font-family:&#39;comic sans ms&#39;,sans-serif;col=
or:rgb(11,83,148)">=C2=A0 return ret1.union(ret2);</div><div style=3D"font-=
family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)"><br></div><=
div style=3D"font-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83=
,148)"><b>} // end of process()</b></div><div style=3D"font-family:&#39;com=
ic sans ms&#39;,sans-serif;color:rgb(11,83,148)"><br></div><div style=3D"fo=
nt-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)"><br></di=
v><div style=3D"font-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11=
,83,148)">// 3.</div><div style=3D"font-family:&#39;comic sans ms&#39;,sans=
-serif;color:rgb(11,83,148)"><b>private Pair&lt;Path, Path&gt; processInter=
mediate() {</b></div><div style=3D"font-family:&#39;comic sans ms&#39;,sans=
-serif;color:rgb(11,83,148)"><br></div><div style=3D"font-family:&#39;comic=
 sans ms&#39;,sans-serif;color:rgb(11,83,148)">=C2=A0 PTable&lt;String, Int=
eger&gt; data =3D ...; // read data from wherever</div><div style=3D"font-f=
amily:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)"><br></div><d=
iv style=3D"font-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,=
148)">=C2=A0 // filter data from the input</div><div style=3D"font-family:&=
#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)">=C2=A0 Path path1 =
=3D filter(data, fs, true); =C2=A0 // filter() will write a PCollection to =
an AvroFileSourceTarget and return its path, which will be used later to re=
ad the collection back and do further processing.</div><div style=3D"font-f=
amily:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)">=C2=A0 Path =
path2 =3D filter(data, fs, false);</div><div style=3D"font-family:&#39;comi=
c sans ms&#39;,sans-serif;color:rgb(11,83,148)"><br></div><div style=3D"fon=
t-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)">=C2=A0 //=
 getPipeline().run();</div><div style=3D"font-family:&#39;comic sans ms&#39=
;,sans-serif;color:rgb(11,83,148)"><br></div><div style=3D"font-family:&#39=
;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)">=C2=A0 return Pair.of(=
path1, path2);</div><div style=3D"font-family:&#39;comic sans ms&#39;,sans-=
serif;color:rgb(11,83,148)"><b><br></b></div><div style=3D"font-family:&#39=
;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)"><b>} // end of=C2=A0</=
b><b>processIntermediate</b></div><div style=3D"font-family:&#39;comic sans=
 ms&#39;,sans-serif;color:rgb(11,83,148)"><b><br></b></div><div style=3D"fo=
nt-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)"><b><br><=
/b></div><div style=3D"font-family:&#39;comic sans ms&#39;,sans-serif;color=
:rgb(11,83,148)">// 4.</div><div style=3D"font-family:&#39;comic sans ms=
9;,sans-serif;color:rgb(11,83,148)"><b>private PTable&lt;String, String&gt;=
 processFinal(Path path) {</b></div><div style=3D"font-family:&#39;comic sa=
ns ms&#39;,sans-serif;color:rgb(11,83,148)"><b><br></b></div><div style=3D"=
font-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)">=C2=A0=
 PCollection&lt;String&gt; table =3D getPipeline().read(new AvroFileSource&=
lt;&gt;(path), records(Strings));</div><div style=3D"font-family:&#39;comic=
 sans ms&#39;,sans-serif;color:rgb(11,83,148)"><br></div><div style=3D"font=
-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)">=C2=A0 ret=
urn table.parallelDo(...);</div><div style=3D"font-family:&#39;comic sans m=
s&#39;,sans-serif;color:rgb(11,83,148)"><br></div><div style=3D"font-family=
:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)"><b>} // end of pr=
ocessFinal</b></div><div style=3D"font-family:&#39;comic sans ms&#39;,sans-=
serif;color:rgb(11,83,148)"><b><br></b></div><div style=3D"font-family:&#39=
;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)"><br></div><div style=
=3D"font-family:&#39;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)">I =
imagine I could probably use Oozie workflow actions to simplify the process=
ing but if this is just a matter of syntax/rearranging the code, I would li=
ke to know it.</div><div style=3D"font-family:&#39;comic sans ms&#39;,sans-=
serif;color:rgb(11,83,148)"><b><br></b></div><div style=3D"font-family:&#39=
;comic sans ms&#39;,sans-serif;color:rgb(11,83,148)"><b>Thanks in advance!<=
/b></div></div>
</blockquote></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>

<br>
<font size=3D"2" color=3D"#808080"><b style=3D"font-family:Calibri,sans-ser=
if;background-color:rgb(255,255,255)">DISCLAIMER:</b><span style=3D"font-fa=
mily:Calibri,sans-serif;background-color:rgb(255,255,255)">=C2=A0The conten=
ts of this email, including any attachments, may contain information that i=
s confidential, proprietary in nature, protected health information (PHI), =
or otherwise protected by law from disclosure, and is solely for the use of=
 the intended recipient(s). If you are not the intended recipient, you are =
hereby notified that any use, disclosure or copying of this email, includin=
g any attachments, is unauthorized and strictly prohibited. If you have rec=
eived this email in error, please notify the sender of this email. Please d=
elete this and all copies of this email from your system. Any opinions eith=
er expressed or implied in this email and all attachments, are those of its=
 author only, and do not necessarily reflect those of Nuna Health, Inc.</sp=
an></font>
--047d7b6250a8839a920522c8ac9b--