Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
Received-SPF: pass (nike.apache.org: domain of unluckyboy@hotmail.com
 designates 65.55.90.212 as permitted sender)
Message-ID: <SNT148-W50754163C58A8C73153804C8790@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_a60cb258-9d7e-4671-87b5-7891eabc12d2_"
From: Danny Morgan <unluckyboy@hotmail.com>
To: "user@crunch.apache.org" <user@crunch.apache.org>
Subject: RE: Multiple Reduces in a Single Crunch Job
Date: Fri, 5 Dec 2014 00:23:19 +0000
Importance: Normal
In-Reply-To: 
 <CAH29n6NONQVwghV=1p0wLj5MXDyHoMbtVfstDYPR1S=m9MBmbg@mail.gmail.com>
References: <SNT148-W34CC04FB6DA237F900E17BC8700@phx.gbl>
 <CAH29n6OUzHEomO5pOOSx_WxvHF9BdUVBthVL2jvciWt77rW8jw@mail.gmail.com>,<SNT148-W13B9ED3D2F8B2462E7BBA6C8700@phx.gbl>
 <CAH29n6N-4rWZMHWrupBSJB-E=MG5wHH52StZwBBj+RSsh=921Q@mail.gmail.com>,<SNT148-W1013EBEE31CFF8FEB4B513C8700@phx.gbl>
 <CAH29n6PsO=85ixNB9Q6F+9OXmhodGz+MyF5CePeQ44yW+2saJQ@mail.gmail.com>,<CANb5z2LqgGy9b_9xc5xEaUx0BgLWcu_WRa2tLSa=w36yOp6Q+w@mail.gmail.com>,<SNT148-W558EB5C28D8479C09FC82DC8700@phx.gbl>
 <CAH29n6OW_HHrRLqgb-gqVj4ZpXjGZYeVSGsD4Jc14cKEA0RYig@mail.gmail.com>,<SNT148-W5522951C1B52161FF5991C8700@phx.gbl>
 <SNT148-W289E835B08FAE56A9EAA14C8700@phx.gbl>,<CAH29n6P_cWxZ5OO-zT6uZrqqf-PuYw9yDFuF7tQdFHS6j_-r0w@mail.gmail.com>,<SNT148-W73159A9B530F9EA72EA90DC8700@phx.gbl>
 <CAH29n6NJgqYWt9pqqpEi+vESfQKc9am7tNk_f85p0S+=mH4pJw@mail.gmail.com>,<SNT148-W6470DF204CE829D132315DC8780@phx.gbl>,<CAH29n6NONQVwghV=1p0wLj5MXDyHoMbtVfstDYPR1S=m9MBmbg@mail.gmail.com>
MIME-Version: 1.0

--_a60cb258-9d7e-4671-87b5-7891eabc12d2_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi Josh=2C

Sorry I mixed up pipelines there is no s3 write in this case.

So you are correct the intermediate Avro file that's the output of the Seco=
ndarySort is labeled "/tmp" I don't manually create this local file=2C the =
crunch planner seems to insert that materialization phase in. If you refer =
back to my original email the error I get is:

"org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path do=
es not exist: hdfs:///tmp/crunch-1279941375/p1"

So the dot plan has the file labeled as "/tmp/crunch-*" however when the jo=
b runs it's expecting to find an "hdfs:///tmp/crunch-*". Is this a labeling=
 issue with the plan output or might this be the bug?

-Danny

From: jwills@cloudera.com
Date: Thu=2C 4 Dec 2014 15:44:37 -0800
Subject: Re: Multiple Reduces in a Single Crunch Job
To: user@crunch.apache.org

Hey Danny=2C
Inlined.
On Thu=2C Dec 4=2C 2014 at 3:20 PM=2C Danny Morgan <unluckyboy@hotmail.com>=
 wrote:
=0A=
=0A=
=0A=
Hi Josh=2C

Thanks for taking the time to look into this.

I do get a PCollection<Object=2C String> and split it. I write the Avro obj=
ects as parquet to HDFS and I get the String collection and write it out to=
 s3n://. I have noticed that the s3n:// targets copy their files to the loc=
al filesystem's /tmp and then copy the file up to s3. This process happens =
serially and is super slow=2C I'm not sure if it's a crunch issue or a gene=
ral HDFS one.

I'm not following=3B I'm referring to the second_phase.pdf plan file=2C whi=
ch has a bunch of Avro inputs that are being merged together and secondary =
sorted (some sort of sessionization=2C I assume) followed by a GBK/combineV=
alues and then the write to Parquet. Where does the PCollection<Object=2C S=
tring> fit in? And is the S3 write part of the same Pipeline instance? I'm =
wondering if the multiple FileSystems are confusing the planner w/respect t=
o where it should create the temp file.=20
Let me know if I can help debug further=2C as I mentioned calling pipeline.=
cache() and pipeline.run() between the reduces did solve my problem althoug=
h I guess it is a hack.

BTW Spotify's crunch-lib looks great=2C any integration plans?

I also really like it and would like to incorporate basically all of it=3B =
will start a thread on dev@ about it and see if David is up for it.=20
-Danny

From: jwills@cloudera.com
Date: Thu=2C 4 Dec 2014 14:21:55 -0800
Subject: Re: Multiple Reduces in a Single Crunch Job
To: user@crunch.apache.org

Danny=2C
Spent a couple of hours today banging on this by hacking on some integratio=
n tests but couldn't replicate it. However=2C I just took a closer look at =
the plan you posted=2C and I noticed that all of the files you are writing =
out are prefixed w/ "hdfs:/" except for the /tmp/crunch-* file that Crunch =
is creating=3B is it possible that Crunch is creating the temp file locally=
 on your client machine for some reason? I can't think of why that would ha=
ppen off the top of my head=2C but if that is the problem=2C I'll at least =
be able to figure out where to look.
Josh

On Tue=2C Nov 25=2C 2014 at 6:30 PM=2C Danny Morgan <unluckyboy@hotmail.com=
> wrote:
=0A=
=0A=
=0A=
No problem=2C Happy Thanksgiving!
Gobble Gobble...

From: jwills@cloudera.com
Date: Tue=2C 25 Nov 2014 18:23:14 -0800
Subject: Re: Multiple Reduces in a Single Crunch Job
To: user@crunch.apache.org

Very useful-- thank you. Will dig into it and report back=2C although I'm h=
eading out for the holiday so it likely won't be until early next week.
J
On Tue=2C Nov 25=2C 2014 at 6:16 PM=2C Danny Morgan <unluckyboy@hotmail.com=
> wrote:
=0A=
=0A=
=0A=
=0A=
=0A=
=0A=
Having a single pipeline in the application didn't fix it. Sticking a pipel=
ine.run() in the middle also didn't matter either=2C the plan appears such =
that the planner is completely ignoring the second the run() I added.
However what DOES WORK is if I do:
collection =3D secondarySort()pipeline.cache(collection)pipeline.run()newco=
llection =3D collection.groupByKey()
If I try adding the cache() without calling run() in between it doesn't wor=
k. Hope that's enough info for you to fix the possible planner bug.
Thanks for the help Josh!
From: unluckyboy@hotmail.com
To: user@crunch.apache.org
Subject: RE: Multiple Reduces in a Single Crunch Job
Date: Wed=2C 26 Nov 2014 01:58:11 +0000

=0A=
=0A=
=0A=
I tried doing a Sample() instead of identity function=2C but that got fused=
 into the reduce as well and didn't work.
First thing I tried was sticking a pipeline.run() in between there and I wa=
s surprised but it didn't work either=2C same error. I'll rerun that config=
 now and try to get the dot files for the plan.
Not sure if this is affecting it but in the same crunch application I have =
a completely independent pipeline the runs before this one executes. I'll t=
urn that off as well and see if it's causing the issue.

From: jwills@cloudera.com
Date: Tue=2C 25 Nov 2014 17:43:52 -0800
Subject: Re: Multiple Reduces in a Single Crunch Job
To: user@crunch.apache.org

Drat=2C I was hoping it was something simple. You could manually fix it by =
injecting a pipeline.run() call between the secondarySort and the groupByKe=
y()=2C but of course=2C we'd like to handle this situation correctly by def=
ault.
J
On Tue=2C Nov 25=2C 2014 at 5:39 PM=2C Danny Morgan <unluckyboy@hotmail.com=
> wrote:
=0A=
=0A=
=0A=
I did a parallelDo with the IdentityFn of the output of the secondarySort a=
nd the IdentityFn was just fused into the reduce phase of the secondarySort=
 and I got the same error message.
I think you want me to somehow force a map phase in between the two reduces=
?
-Danny

From: josh.wills@gmail.com
Date: Tue=2C 25 Nov 2014 17:23:29 -0800
Subject: Re: Multiple Reduces in a Single Crunch Job
To: user@crunch.apache.org

Oh=2C dumb question-- if you put like a dummy function between the secondar=
ySort and the groupByKey=2C like an IdentityFn or something=2C do things wo=
rk again? That would help w/diagnosing the problem.
On Tue=2C Nov 25=2C 2014 at 5:15 PM=2C Josh Wills <jwills@cloudera.com> wro=
te:
So if you're getting it quickly=2C it might be b/c the job isn't recognizin=
g the dependency between the two separate phases of the job for some reason=
 (e.g.=2C it's not realizing that one job has to be run before the other on=
e.) That's an odd situation=2C but we have had bugs like that in the past=
=3B let me see if I can re-create the situation in an integration test. Whi=
ch version of Crunch?
J
On Tue=2C Nov 25=2C 2014 at 4:40 PM=2C Danny Morgan <unluckyboy@hotmail.com=
> wrote:
=0A=
=0A=
=0A=
No that's definitely not it. I get this issue if I write to a single output=
 as well.
If I remove the groupByKey().combineValues() line and just write out the ou=
tput from the SecondarySort it works. Seems to only complain about the temp=
 path not existing when I have multiple reduce phases in the pipeline. Also=
 the error seems to happen immediately during the setup or planning phase=
=2C I assume this because the yarn jobs get created but they don't do anyth=
ing=2C and instead of FAILED the error message is "Application killed by us=
er."
-Danny

From: jwills@cloudera.com
Date: Tue=2C 25 Nov 2014 16:30:58 -0800
Subject: Re: Multiple Reduces in a Single Crunch Job
To: user@crunch.apache.org

Ack=2C sorry-- it's this: https://issues.apache.org/jira/browse/CRUNCH-481
On Tue=2C Nov 25=2C 2014 at 4:24 PM=2C Danny Morgan <unluckyboy@hotmail.com=
> wrote:
=0A=
=0A=
=0A=
Hello Again Josh=2C
The link to the Jira issue you sent out seems to be cut off=2C could you pl=
ease resend it?
I deleted the line where I write the collection to a text file=2C and retri=
ed it but it didn't work either. Also tried writing the collection out as A=
vro instead of Parquet=2C but got the same error.
Here's the rest of the stracktrace:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path doe=
s not exist: hdfs:///tmp/crunch-2008950085/p1        at org.apache.hadoop.m=
apreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)    =
    at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSpli=
ts(CombineFileInputFormat.java:217)        at org.apache.crunch.impl.mr.run=
.CrunchInputFormat.getSplits(CrunchInputFormat.java:65)        at org.apach=
e.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:491)      =
  at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java=
:508)        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(=
JobSubmitter.java:392)        at org.apache.hadoop.mapreduce.Job$10.run(Job=
.java:1268)        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)=
        at java.security.AccessController.doPrivileged(Native Method)      =
  at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apach=
e.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528)=
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)        at =
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submi=
t(CrunchControlledJob.java:340)        at org.apache.crunch.hadoop.mapreduc=
e.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:277)=
        at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobContr=
ol.pollJobStatusAndStartNewOnes(CrunchJobControl.java:316)        at org.ap=
ache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:113)       =
 at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:55=
)        at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java=
:84)        at java.lang.Thread.run(Thread.java:744)
Thanks Josh!
From: jwills@cloudera.com
Date: Tue=2C 25 Nov 2014 16:10:33 -0800
Subject: Re: Multiple Reduces in a Single Crunch Job
To: user@crunch.apache.org

Hey Danny=2C
I'm wondering if this is caused by https://issues.apache.org/jira/browse/CR=
UNCH-481-- I think we use different output committers for text files vs. pa=
rquet files=2C so at least one of the outputs won't be written properly-- d=
oes that make sense?
Josh
On Tue=2C Nov 25=2C 2014 at 4:07 PM=2C Danny Morgan <unluckyboy@hotmail.com=
> wrote:
=0A=
=0A=
=0A=
Hi Crunchers=2C
I've attached a pdf of what my plan looks like. I've run into this problem =
before where I have multiple reduce steps chained together in a single pipe=
line and always get the same error.
In the case of the attached pdf the error is "org.apache.hadoop.mapreduce.l=
ib.input.InvalidInputException: Input path does not exist: hdfs:///tmp/crun=
ch-1279941375/p1"
That's the temp directory the crunch planner set up for the first reduce ph=
ase.
Can I run multiple chained reduces within the same pipeline? Do I have to m=
anually write out the output from the first reduce?
Here's what the code looks like:
      // Simple mapper      PTable<String=2C Pair<Long=2C Log>> first =3D D=
anny.filterForDanny(logs)=3B      // Secondary sort happens here      PTabl=
e<Danny=2C Long> second =3D Danny.extractDannys(first)=3B      // Regular g=
roup by      PTable<Danny=2C Long> third =3D second.groupByKey().combineVal=
ues(Aggregators.SUM_LONGS())=3B      // simple function that populates some=
 fields in the Danny object with the aggregate results      PCollection<Pai=
r<Danny=2C String>> done =3D Danny.finalize(third)=3B      Pair<PCollection=
<Danny>=2C PCollection<String>> splits =3D Channels.split(done)=3B      spl=
its.second().write(To.textFile(mypath=2C WriteMode.OVERWRITE)=3B      Targe=
t pq_danny =3D new AvroParquetFileTarget(pqPath))=3B      splits.first().wr=
ite(pq_danny=2C WriteMode.OVERWRITE)
Thanks!
-Danny 		 	   		  =0A=


--=20
Director of Data ScienceClouderaTwitter: @josh_wills=0A=
 		 	   		  =0A=


--=20
Director of Data ScienceClouderaTwitter: @josh_wills=0A=
 		 	   		  =0A=


--=20
Director of Data ScienceClouderaTwitter: @josh_wills=0A=
=0A=

 		 	   		  =0A=


--=20
Director of Data ScienceClouderaTwitter: @josh_wills=0A=
 		 	   		  =0A=
 		 	   		  =0A=


--=20
Director of Data ScienceClouderaTwitter: @josh_wills=0A=
 		 	   		  =0A=


--=20
Director of Data ScienceClouderaTwitter: @josh_wills=0A=
 		 	   		  =0A=


--=20
Director of Data ScienceClouderaTwitter: @josh_wills=0A=
 		 	   		  =

--_a60cb258-9d7e-4671-87b5-7891eabc12d2_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 12pt=3B
font-family:Calibri
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'>Hi Josh=2C<br><br>Sorry I mixed =
up pipelines there is no s3 write in this case.<br><br>So you are correct t=
he intermediate Avro file that's the output of the SecondarySort is labeled=
 "/tmp" I don't manually create this local file=2C the crunch planner seems=
 to insert that materialization phase in. If you refer back to my original =
email the error I get is:<br><br><span style=3D"text-align:center"><font si=
ze=3D"3">"</font>org.apache.hadoop.mapreduce.<wbr>lib.input.<wbr>InvalidInp=
utException: Input path does not exist: hdfs:///tmp/crunch-1279941375/<wbr>=
p1<font style=3D"font-size: 12pt=3B" size=3D"3">"<br><br>So the dot plan ha=
s the file labeled as "/tmp/crunch-*" however when the job runs it's expect=
ing to find an "hdfs:///tmp/crunch-*". Is this a labeling issue with the pl=
an output or might this be the bug?<br><br>-Danny<br></font></span><br><div=
><hr id=3D"stopSpelling">From: jwills@cloudera.com<br>Date: Thu=2C 4 Dec 20=
14 15:44:37 -0800<br>Subject: Re: Multiple Reduces in a Single Crunch Job<b=
r>To: user@crunch.apache.org<br><br><div dir=3D"ltr">Hey Danny=2C<div><br><=
/div><div>Inlined.</div><div class=3D"ecxgmail_extra"><br><div class=3D"ecx=
gmail_quote">On Thu=2C Dec 4=2C 2014 at 3:20 PM=2C Danny Morgan <span dir=
=3D"ltr">&lt=3B<a href=3D"mailto:unluckyboy@hotmail.com" target=3D"_blank">=
unluckyboy@hotmail.com</a>&gt=3B</span> wrote:<br><blockquote class=3D"ecxg=
mail_quote" style=3D"border-left:1px #ccc solid=3Bpadding-left:1ex=3B">=0A=
=0A=
=0A=
<div><div dir=3D"ltr">Hi Josh=2C<br><br>Thanks for taking the time to look =
into this.<br><br>I do get a PCollection&lt=3BObject=2C String&gt=3B and sp=
lit it. I write the Avro objects as parquet to HDFS and I get the String co=
llection and write it out to s3n://. I have noticed that the s3n:// targets=
 copy their files to the local filesystem's /tmp and then copy the file up =
to s3. This process happens serially and is super slow=2C I'm not sure if i=
t's a crunch issue or a general HDFS one.<br></div></div></blockquote><div>=
<br></div><div>I'm not following=3B I'm referring to the second_phase.pdf p=
lan file=2C which has a bunch of Avro inputs that are being merged together=
 and secondary sorted (some sort of sessionization=2C I assume) followed by=
 a GBK/combineValues and then the write to Parquet. Where does the PCollect=
ion&lt=3BObject=2C String&gt=3B fit in? And is the S3 write part of the sam=
e Pipeline instance? I'm wondering if the multiple FileSystems are confusin=
g the planner w/respect to where it should create the temp file.</div><div>=
&nbsp=3B</div><blockquote class=3D"ecxgmail_quote" style=3D"border-left:1px=
 #ccc solid=3Bpadding-left:1ex=3B"><div><div dir=3D"ltr"><br>Let me know if=
 I can help debug further=2C as I mentioned calling pipeline.cache() and pi=
peline.run() between the reduces did solve my problem although I guess it i=
s a hack.<br><br>BTW Spotify's crunch-lib looks great=2C any integration pl=
ans?<br></div></div></blockquote><div><br></div><div>I also really like it =
and would like to incorporate basically all of it=3B will start a thread on=
 dev@ about it and see if David is up for it.</div><div>&nbsp=3B</div><bloc=
kquote class=3D"ecxgmail_quote" style=3D"border-left:1px #ccc solid=3Bpaddi=
ng-left:1ex=3B"><div><div dir=3D"ltr"><br>-Danny<br><br><div><hr>From: <a h=
ref=3D"mailto:jwills@cloudera.com" target=3D"_blank">jwills@cloudera.com</a=
><br>Date: Thu=2C 4 Dec 2014 14:21:55 -0800<div><div class=3D"h5"><br>Subje=
ct: Re: Multiple Reduces in a Single Crunch Job<br>To: <a href=3D"mailto:us=
er@crunch.apache.org" target=3D"_blank">user@crunch.apache.org</a><br><br><=
div dir=3D"ltr">Danny=2C<div><br></div><div>Spent a couple of hours today b=
anging on this by hacking on some integration tests but couldn't replicate =
it. However=2C I just took a closer look at the plan you posted=2C and I no=
ticed that all of the files you are writing out are prefixed w/ "hdfs:/" ex=
cept for the /tmp/crunch-* file that Crunch is creating=3B is it possible t=
hat Crunch is creating the temp file locally on your client machine for som=
e reason? I can't think of why that would happen off the top of my head=2C =
but if that is the problem=2C I'll at least be able to figure out where to =
look.</div><div><br></div><div>Josh</div><div><br></div></div><div><br><div=
>On Tue=2C Nov 25=2C 2014 at 6:30 PM=2C Danny Morgan <span dir=3D"ltr">&lt=
=3B<a href=3D"mailto:unluckyboy@hotmail.com" target=3D"_blank">unluckyboy@h=
otmail.com</a>&gt=3B</span> wrote:<br><blockquote style=3D"border-left:1px =
#ccc solid=3Bpadding-left:1ex=3B">=0A=
=0A=
=0A=
<div><div dir=3D"ltr">No problem=2C Happy Thanksgiving!<div><br></div><div>=
Gobble Gobble...<br><br><div><hr>From: <a href=3D"mailto:jwills@cloudera.co=
m" target=3D"_blank">jwills@cloudera.com</a><br>Date: Tue=2C 25 Nov 2014 18=
:23:14 -0800<div><div><br>Subject: Re: Multiple Reduces in a Single Crunch =
Job<br>To: <a href=3D"mailto:user@crunch.apache.org" target=3D"_blank">user=
@crunch.apache.org</a><br><br><div dir=3D"ltr">Very useful-- thank you. Wil=
l dig into it and report back=2C although I'm heading out for the holiday s=
o it likely won't be until early next week.<div><br></div><div>J</div></div=
><div><br><div>On Tue=2C Nov 25=2C 2014 at 6:16 PM=2C Danny Morgan <span di=
r=3D"ltr">&lt=3B<a href=3D"mailto:unluckyboy@hotmail.com" target=3D"_blank"=
>unluckyboy@hotmail.com</a>&gt=3B</span> wrote:<br><blockquote style=3D"bor=
der-left:1px #ccc solid=3Bpadding-left:1ex=3B">=0A=
=0A=
=0A=
<div><div dir=3D"ltr">=0A=
=0A=
=0A=
<div dir=3D"ltr"><div>Having a single pipeline in the application didn't fi=
x it. Sticking a pipeline.run() in the middle also didn't matter either=2C =
the plan appears such that the planner is completely ignoring the second th=
e run() I added.</div><div><br></div><div>However what DOES WORK is if I do=
:</div><div><br></div><div>collection =3D secondarySort()</div><div>pipelin=
e.cache(collection)</div><div>pipeline.run()</div><div>newcollection =3D co=
llection.groupByKey()</div><div><br></div><div>If I try adding the cache() =
without calling run() in between it doesn't work. Hope that's enough info f=
or you to fix the possible planner bug.</div><div><br></div><div>Thanks for=
 the help Josh!</div><div><br><div><div><hr>From: <a href=3D"mailto:unlucky=
boy@hotmail.com" target=3D"_blank">unluckyboy@hotmail.com</a><br>To: <a hre=
f=3D"mailto:user@crunch.apache.org" target=3D"_blank">user@crunch.apache.or=
g</a><br>Subject: RE: Multiple Reduces in a Single Crunch Job<br>Date: Wed=
=2C 26 Nov 2014 01:58:11 +0000</div><div><div><br><br>=0A=
=0A=
=0A=
<div dir=3D"ltr">I tried doing a Sample() instead of identity function=2C b=
ut that got fused into the reduce as well and didn't work.<div><br></div><d=
iv>First thing I tried was sticking a pipeline.run() in between there and I=
 was surprised but it didn't work either=2C same error. I'll rerun that con=
fig now and try to get the dot files for the plan.</div><div><br></div><div=
>Not sure if this is affecting it but in the same crunch application I have=
 a completely independent pipeline the runs before this one executes. I'll =
turn that off as well and see if it's causing the issue.<br><br><div><hr>Fr=
om: <a href=3D"mailto:jwills@cloudera.com" target=3D"_blank">jwills@clouder=
a.com</a><br>Date: Tue=2C 25 Nov 2014 17:43:52 -0800<br>Subject: Re: Multip=
le Reduces in a Single Crunch Job<br>To: <a href=3D"mailto:user@crunch.apac=
he.org" target=3D"_blank">user@crunch.apache.org</a><br><br><div dir=3D"ltr=
">Drat=2C I was hoping it was something simple. You could manually fix it b=
y injecting a pipeline.run() call between the secondarySort and the groupBy=
Key()=2C but of course=2C we'd like to handle this situation correctly by d=
efault.<div><br></div><div>J</div></div><div><br><div>On Tue=2C Nov 25=2C 2=
014 at 5:39 PM=2C Danny Morgan <span dir=3D"ltr">&lt=3B<a href=3D"mailto:un=
luckyboy@hotmail.com" target=3D"_blank">unluckyboy@hotmail.com</a>&gt=3B</s=
pan> wrote:<br><blockquote style=3D"border-left:1px #ccc solid=3Bpadding-le=
ft:1ex=3B">=0A=
=0A=
=0A=
<div><div dir=3D"ltr">I did a parallelDo with the IdentityFn of the output =
of the secondarySort and the IdentityFn was just fused into the reduce phas=
e of the secondarySort and I got the same error message.<div><br></div><div=
>I think you want me to somehow force a map phase in between the two reduce=
s?</div><div><br></div><div>-Danny<br><br><div><hr>From: <a href=3D"mailto:=
josh.wills@gmail.com" target=3D"_blank">josh.wills@gmail.com</a><br>Date: T=
ue=2C 25 Nov 2014 17:23:29 -0800<div><div><br>Subject: Re: Multiple Reduces=
 in a Single Crunch Job<br>To: <a href=3D"mailto:user@crunch.apache.org" ta=
rget=3D"_blank">user@crunch.apache.org</a><br><br><div dir=3D"ltr">Oh=2C du=
mb question-- if you put like a dummy function between the secondarySort an=
d the groupByKey=2C like an IdentityFn or something=2C do things work again=
? That would help w/diagnosing the problem.</div><div><br><div>On Tue=2C No=
v 25=2C 2014 at 5:15 PM=2C Josh Wills <span dir=3D"ltr">&lt=3B<a href=3D"ma=
ilto:jwills@cloudera.com" target=3D"_blank">jwills@cloudera.com</a>&gt=3B</=
span> wrote:<br><blockquote style=3D"border-left:1px #ccc solid=3Bpadding-l=
eft:1ex=3B"><div dir=3D"ltr">So if you're getting it quickly=2C it might be=
 b/c the job isn't recognizing the dependency between the two separate phas=
es of the job for some reason (e.g.=2C it's not realizing that one job has =
to be run before the other one.) That's an odd situation=2C but we have had=
 bugs like that in the past=3B let me see if I can re-create the situation =
in an integration test. Which version of Crunch?<span><font color=3D"#88888=
8"><div><br></div><div>J</div></font></span></div><div><div><div><br><div>O=
n Tue=2C Nov 25=2C 2014 at 4:40 PM=2C Danny Morgan <span dir=3D"ltr">&lt=3B=
<a href=3D"mailto:unluckyboy@hotmail.com" target=3D"_blank">unluckyboy@hotm=
ail.com</a>&gt=3B</span> wrote:<br><blockquote style=3D"border-left:1px #cc=
c solid=3Bpadding-left:1ex=3B">=0A=
=0A=
=0A=
<div><div dir=3D"ltr">No that's definitely not it. I get this issue if I wr=
ite to a single output as well.<div><br></div><div>If I remove the&nbsp=3Bg=
roupByKey().combineValues() line and just write out the output from the Sec=
ondarySort it works. Seems to only complain about the temp path not existin=
g when I have multiple reduce phases in the pipeline. Also the error seems =
to happen immediately during the setup or planning phase=2C I assume this b=
ecause the yarn jobs get created but they don't do anything=2C and instead =
of FAILED the error message is "Application killed by user."</div><div><br>=
</div><div>-Danny<br><br><div><hr>From: <a href=3D"mailto:jwills@cloudera.c=
om" target=3D"_blank">jwills@cloudera.com</a><br>Date: Tue=2C 25 Nov 2014 1=
6:30:58 -0800<div><div><br>Subject: Re: Multiple Reduces in a Single Crunch=
 Job<br>To: <a href=3D"mailto:user@crunch.apache.org" target=3D"_blank">use=
r@crunch.apache.org</a><br><br><div dir=3D"ltr">Ack=2C sorry-- it's this:&n=
bsp=3B<a href=3D"https://issues.apache.org/jira/browse/CRUNCH-481--" style=
=3D"font-family:arial=2Csans-serif=3Bfont-size:12.7272720336914px=3B" targe=
t=3D"_blank">https://issues.apache.org/jira/browse/CRUNCH-481</a></div><div=
><br><div>On Tue=2C Nov 25=2C 2014 at 4:24 PM=2C Danny Morgan <span dir=3D"=
ltr">&lt=3B<a href=3D"mailto:unluckyboy@hotmail.com" target=3D"_blank">unlu=
ckyboy@hotmail.com</a>&gt=3B</span> wrote:<br><blockquote style=3D"border-l=
eft:1px #ccc solid=3Bpadding-left:1ex=3B">=0A=
=0A=
=0A=
<div><div dir=3D"ltr">Hello Again Josh=2C<div><br></div><div>The link to th=
e Jira issue you sent out seems to be cut off=2C could you please resend it=
?</div><div><br></div><div>I deleted the line where I write the collection =
to a text file=2C and retried it but it didn't work either. Also tried writ=
ing the collection out as Avro instead of Parquet=2C but got the same error=
.</div><div><br></div><div>Here's the rest of the stracktrace:</div><div><b=
r></div><div><div>org.apache.hadoop.mapreduce.lib.input.InvalidInputExcepti=
on: Input path does not exist: hdfs:///tmp/crunch-2008950085/p1</div><div>&=
nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B at org.apache.hadoop.mapreduce.lib.input=
.FileInputFormat.listStatus(FileInputFormat.java:285)</div><div>&nbsp=3B &n=
bsp=3B &nbsp=3B &nbsp=3B at org.apache.hadoop.mapreduce.lib.input.CombineFi=
leInputFormat.getSplits(CombineFileInputFormat.java:217)</div><div>&nbsp=3B=
 &nbsp=3B &nbsp=3B &nbsp=3B at org.apache.crunch.impl.mr.run.CrunchInputFor=
mat.getSplits(CrunchInputFormat.java:65)</div><div>&nbsp=3B &nbsp=3B &nbsp=
=3B &nbsp=3B at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(Job=
Submitter.java:491)</div><div>&nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B at org.ap=
ache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:508)</div>=
<div>&nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B at org.apache.hadoop.mapreduce.Job=
Submitter.submitJobInternal(JobSubmitter.java:392)</div><div>&nbsp=3B &nbsp=
=3B &nbsp=3B &nbsp=3B at org.apache.hadoop.mapreduce.Job$10.run(Job.java:12=
68)</div><div>&nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B at org.apache.hadoop.mapr=
educe.Job$10.run(Job.java:1265)</div><div>&nbsp=3B &nbsp=3B &nbsp=3B &nbsp=
=3B at java.security.AccessController.doPrivileged(Native Method)</div><div=
>&nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B at javax.security.auth.Subject.doAs(Su=
bject.java:415)</div><div>&nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B at org.apache=
.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528)<=
/div><div>&nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B at org.apache.hadoop.mapreduc=
e.Job.submit(Job.java:1265)</div><div>&nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B a=
t org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.sub=
mit(CrunchControlledJob.java:340)</div><div>&nbsp=3B &nbsp=3B &nbsp=3B &nbs=
p=3B at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.=
startReadyJobs(CrunchJobControl.java:277)</div><div>&nbsp=3B &nbsp=3B &nbsp=
=3B &nbsp=3B at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJob=
Control.pollJobStatusAndStartNewOnes(CrunchJobControl.java:316)</div><div>&=
nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B at org.apache.crunch.impl.mr.exec.MRExec=
utor.monitorLoop(MRExecutor.java:113)</div><div>&nbsp=3B &nbsp=3B &nbsp=3B =
&nbsp=3B at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor=
.java:55)</div><div>&nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B at org.apache.crunc=
h.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:84)</div><div>&nbsp=3B &nbs=
p=3B &nbsp=3B &nbsp=3B at java.lang.Thread.run(Thread.java:744)</div><div><=
br></div><div>Thanks Josh!</div><br><div><hr>From: <a href=3D"mailto:jwills=
@cloudera.com" target=3D"_blank">jwills@cloudera.com</a><br>Date: Tue=2C 25=
 Nov 2014 16:10:33 -0800<br>Subject: Re: Multiple Reduces in a Single Crunc=
h Job<br>To: <a href=3D"mailto:user@crunch.apache.org" target=3D"_blank">us=
er@crunch.apache.org</a><div><div><br><br><div dir=3D"ltr">Hey Danny=2C<div=
><br></div><div>I'm wondering if this is caused by&nbsp=3B<a href=3D"https:=
//issues.apache.org/jira/browse/CRUNCH-481--" target=3D"_blank">https://iss=
ues.apache.org/jira/browse/CRUNCH-481--</a> I think we use different output=
 committers for text files vs. parquet files=2C so at least one of the outp=
uts won't be written properly-- does that make sense?</div><div><br></div><=
div>Josh</div></div><div><br><div>On Tue=2C Nov 25=2C 2014 at 4:07 PM=2C Da=
nny Morgan <span dir=3D"ltr">&lt=3B<a href=3D"mailto:unluckyboy@hotmail.com=
" target=3D"_blank">unluckyboy@hotmail.com</a>&gt=3B</span> wrote:<br><bloc=
kquote style=3D"border-left:1px #ccc solid=3Bpadding-left:1ex=3B">=0A=
=0A=
=0A=
<div><div dir=3D"ltr">Hi Crunchers=2C<div><br></div><div>I've attached a pd=
f of what my plan looks like. I've run into this problem before where I hav=
e multiple reduce steps chained together in a single pipeline and always ge=
t the same error.</div><div><span style=3D"text-align:center=3Bfont-size:12=
pt=3B"><br></span></div><div><span style=3D"text-align:center=3B"><font siz=
e=3D"3">In the case of the attached pdf the error is "</font>org.apache.had=
oop.mapreduce.lib.input.InvalidInputException: Input path does not exist: h=
dfs:///tmp/crunch-1279941375/p1<font size=3D"3">"</font></span></div><div><=
span style=3D"text-align:center=3B"><font size=3D"3"><br></font></span></di=
v><div><span style=3D"text-align:center=3B"><font size=3D"3">That's the tem=
p directory the crunch planner set up for the first reduce phase.</font></s=
pan></div><div><span style=3D"text-align:center=3B"><font size=3D"3"><br></=
font></span></div><div><span style=3D"text-align:center=3B"><font size=3D"3=
">Can I run multiple chained reduces within the same pipeline? Do I have to=
 manually write out the output from the first reduce?</font></span></div><d=
iv><span style=3D"text-align:center=3B"><font size=3D"3"><br></font></span>=
</div><div><span style=3D"text-align:center=3B"><font size=3D"3">Here's wha=
t the code looks like:</font></span></div><div style=3D"text-align:left=3B"=
><span style=3D"text-align:center=3B"><font size=3D"3"><br></font></span></=
div><div><span style=3D"text-align:center=3B"><font size=3D"3"><div style=
=3D"text-align:left=3B">&nbsp=3B &nbsp=3B &nbsp=3B // Simple mapper</div><d=
iv style=3D"text-align:left=3B">&nbsp=3B &nbsp=3B &nbsp=3B PTable&lt=3BStri=
ng=2C Pair&lt=3BLong=2C Log&gt=3B&gt=3B first =3D Danny.filterForDanny(logs=
)=3B</div><div style=3D"text-align:left=3B">&nbsp=3B &nbsp=3B &nbsp=3B // S=
econdary sort happens here</div><div style=3D"text-align:left=3B">&nbsp=3B =
&nbsp=3B &nbsp=3B PTable&lt=3BDanny=2C Long&gt=3B second =3D Danny.extractD=
annys(first)=3B</div><div style=3D"text-align:left=3B">&nbsp=3B &nbsp=3B &n=
bsp=3B // Regular group by</div><div style=3D"text-align:left=3B">&nbsp=3B =
&nbsp=3B &nbsp=3B PTable&lt=3BDanny=2C Long&gt=3B third =3D second.groupByK=
ey().combineValues(Aggregators.SUM_LONGS())=3B</div><div style=3D"text-alig=
n:left=3B">&nbsp=3B &nbsp=3B &nbsp=3B // simple function that populates som=
e fields in the Danny object with the aggregate results</div><div style=3D"=
text-align:left=3B">&nbsp=3B &nbsp=3B &nbsp=3B PCollection&lt=3BPair&lt=3BD=
anny=2C String&gt=3B&gt=3B done =3D Danny.finalize(third)=3B</div><div styl=
e=3D"text-align:left=3B">&nbsp=3B &nbsp=3B &nbsp=3B Pair&lt=3BPCollection&l=
t=3BDanny&gt=3B=2C PCollection&lt=3BString&gt=3B&gt=3B splits =3D Channels.=
split(done)=3B</div><div style=3D"text-align:left=3B">&nbsp=3B &nbsp=3B &nb=
sp=3B splits.second().write(To.textFile(mypath=2C WriteMode.OVERWRITE)=3B</=
div><div style=3D"text-align:left=3B">&nbsp=3B &nbsp=3B &nbsp=3B Target pq_=
danny =3D new AvroParquetFileTarget(pqPath))=3B</div><div style=3D"text-ali=
gn:left=3B">&nbsp=3B &nbsp=3B &nbsp=3B splits.first().write(pq_danny=2C Wri=
teMode.OVERWRITE)</div></font></span></div><div><span style=3D"text-align:c=
enter=3B"><font size=3D"3"><br></font></span></div><div><span style=3D"text=
-align:center=3B"><font size=3D"3">Thanks!</font></span></div><span><font c=
olor=3D"#888888"><div><span style=3D"text-align:center=3B"><font size=3D"3"=
><br></font></span></div><div><span style=3D"text-align:center=3B"><font si=
ze=3D"3">-Danny</font></span></div> 		 	   		  </font></span></div></div>=
=0A=
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div><div>Di=
rector of Data Science</div><div><a href=3D"http://www.cloudera.com" target=
=3D"_blank">Cloudera</a></div><div>Twitter: <a href=3D"http://twitter.com/j=
osh_wills" target=3D"_blank">@josh_wills</a></div></div>=0A=
</div></div></div></div></div> 		 	   		  </div></div>=0A=
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div><div>Di=
rector of Data Science</div><div><a href=3D"http://www.cloudera.com" target=
=3D"_blank">Cloudera</a></div><div>Twitter: <a href=3D"http://twitter.com/j=
osh_wills" target=3D"_blank">@josh_wills</a></div></div>=0A=
</div></div></div></div></div> 		 	   		  </div></div>=0A=
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div><div>Di=
rector of Data Science</div><div><a href=3D"http://www.cloudera.com" target=
=3D"_blank">Cloudera</a></div><div>Twitter: <a href=3D"http://twitter.com/j=
osh_wills" target=3D"_blank">@josh_wills</a></div></div>=0A=
</div>=0A=
</div></div></blockquote></div><br></div></div></div></div></div> 		 	   		=
  </div></div>=0A=
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div><div>Di=
rector of Data Science</div><div><a href=3D"http://www.cloudera.com" target=
=3D"_blank">Cloudera</a></div><div>Twitter: <a href=3D"http://twitter.com/j=
osh_wills" target=3D"_blank">@josh_wills</a></div></div>=0A=
</div></div></div> 		 	   		  </div></div></div></div></div></div>=0A=
 		 	   		  </div></div>=0A=
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div><div>Di=
rector of Data Science</div><div><a href=3D"http://www.cloudera.com" target=
=3D"_blank">Cloudera</a></div><div>Twitter: <a href=3D"http://twitter.com/j=
osh_wills" target=3D"_blank">@josh_wills</a></div></div>=0A=
</div></div></div></div></div> 		 	   		  </div></div>=0A=
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div><div>Di=
rector of Data Science</div><div><a href=3D"http://www.cloudera.com" target=
=3D"_blank">Cloudera</a></div><div>Twitter: <a href=3D"http://twitter.com/j=
osh_wills" target=3D"_blank">@josh_wills</a></div></div>=0A=
</div></div></div></div> 		 	   		  </div></div>=0A=
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"ecxgmail_signature"><div>Director of Data Science</div><div><a href=3D"=
http://www.cloudera.com" target=3D"_blank">Cloudera</a></div><div>Twitter: =
<a href=3D"http://twitter.com/josh_wills" target=3D"_blank">@josh_wills</a>=
</div></div>=0A=
</div></div></div> 		 	   		  </div></body>
</html>=

--_a60cb258-9d7e-4671-87b5-7891eabc12d2_--