Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
MIME-Version: 1.0
In-Reply-To: <CAC79LcZE3CFn9kmcAHzvKJgsXX+MdVrBkOV9GaN+YJhVL+tzQA@mail.gmail.com>
References: <B47A4562-E40D-4575-B1CF-DD6F5ED957B1@gmail.com>
 <7C1189D566F1FED2.1C8D5E2A-8BDE-4FB4-B524-67BF2E78B596@mail.outlook.com>
 <704129E5-5F14-467B-802E-8DBB3768CE56@gmail.com> <8A4341CE-475A-4624-9232-1AA9031BBD44@gmail.com>
 <CAC79LcZB5bNcYgK6cjS+GhQDasDqugr9yEWcLn6CAn73df4dLg@mail.gmail.com>
 <14F65134-3716-4E15-8D05-D3CCFBCBCDCB@gmail.com> <CAC79LcbyOTGdwu9+gL=vudVrU+m_JqPwG61DKS_BCTg-_Q4HxA@mail.gmail.com>
 <CAC79LcahDCos-HCMnmxmLTq6Ly6yUQyurapxMWH7XmoMf9CKeg@mail.gmail.com>
 <CDEED80B-7736-46B7-8814-A8C53CD2FAE5@gmail.com> <CAC79LcZE3CFn9kmcAHzvKJgsXX+MdVrBkOV9GaN+YJhVL+tzQA@mail.gmail.com>
From: Josh Wills <josh.wills@gmail.com>
Date: Sat, 16 Jul 2016 20:31:08 -0700
Message-ID: <CANb5z2+0yZ1u-nBVJRx2o_ZquCPC2T6L=_0ROnGnYD+VtbDFSg@mail.gmail.com>
Subject: Re: Processing many map only collections in single pipeline with spark
To: "user@crunch.apache.org" <user@crunch.apache.org>
Content-Type: multipart/alternative; boundary=001a113cd768a91bea0537cc7cdc
archived-at: Sun, 17 Jul 2016 03:31:41 -0000

--001a113cd768a91bea0537cc7cdc
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

The TL;DR is that Spark doesn't really have a proper multiple outputs model
ala Crunch/MR-- i.e., jobs aren't kicked off until you do some sort of
write, and as soon as you do a write, Spark myopically executes all of the
code that needs to happen in order for that write to be completed. You need
to be fairly clever about sequencing your writes and doing intermediate
caching to make sure your pipeline executes efficiently.

On the Crunch/MR side, I'm a little surprised that we're only executing the
map-only jobs one at a time-- I'm assuming you're not mucking with the
crunch.max.running.jobs parameter in some way? (see:
https://crunch.apache.org/user-guide.html#mrpipeline )


J

On Sat, Jul 16, 2016 at 7:29 PM, David Ortiz <dpo5003@gmail.com> wrote:

> Hmm.  Just out of curiosity, what if you do Pipeline.read in place of
> readTextFile?
>
> On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <benjijuhn@gmail.com> wrote:
>
>> Nope, it queues up the jobs in series there too.
>>
>> On Jul 16, 2016, at 6:01 PM, David Ortiz <dpo5003@gmail.com> wrote:
>>
>> *run in parallel
>>
>> On Sat, Jul 16, 2016, 5:36 PM David Ortiz <dpo5003@gmail.com> wrote:
>>
>>> Just out of curiosity, if you use mrpipeline does it fun on parallel?
>>> If so, issue may be in spark since I believe crunch leaves it to spark =
to
>>> handle best method of execution.
>>>
>>> On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <benjijuhn@gmail.com> wrote:
>>>
>>>> Hey David,
>>>>
>>>> I have 100 active executors, each job typically only uses a few.  It=
=E2=80=99s
>>>> running on yarn.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>> On Jul 16, 2016, at 12:53 PM, David Ortiz <dpo5003@gmail.com> wrote:
>>>>
>>>> What are the cluster resources available vs what a single map uses?
>>>>
>>>> On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <benjijuhn@gmail.com> wrote:
>>>>
>>>>> I enabled FAIR scheduling hoping that would help but only one job is
>>>>> showing up a time.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>> On Jul 15, 2016, at 8:17 PM, Ben Juhn <benjijuhn@gmail.com> wrote:
>>>>>
>>>>> Each input is of a different format, and the DoFn implementation
>>>>> handles them depending on instantiation parameters.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <sjdurfey@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Instead of using readTextFile on the pipeline, try using the read
>>>>> method and use the TextFileSource, which can accept in a collection o=
f
>>>>> paths.
>>>>>
>>>>>
>>>>> https://github.com/apache/crunch/blob/master/crunch-core/src/main/jav=
a/org/apache/crunch/io/text/TextFileSource.java
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <benjijuhn@gmail.co=
m
>>>>> > wrote:
>>>>>
>>>>> Hello,
>>>>>>
>>>>>> I have a job configured the following way:
>>>>>>
>>>>>> for (String path : paths) {
>>>>>>     PCollection<String> col =3D pipeline.readTextFile(path);
>>>>>>     col.parallelDo(new MyDoFn(path), Writables.strings()).write(To.t=
extFile(=E2=80=9Cout/=E2=80=9C + path), Target.WriteMode.APPEND);
>>>>>> }
>>>>>> pipeline.done();
>>>>>>
>>>>>> It results in one spark job for each path, and the jobs run in seque=
nce even though there are no dependencies.  Is it possible to have the jobs=
 run in parallel?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>

--001a113cd768a91bea0537cc7cdc
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">The TL;DR is that Spark doesn&#39;t really have a proper m=
ultiple outputs model ala Crunch/MR-- i.e., jobs aren&#39;t kicked off unti=
l you do some sort of write, and as soon as you do a write, Spark myopicall=
y executes all of the code that needs to happen in order for that write to =
be completed. You need to be fairly clever about sequencing your writes and=
 doing intermediate caching to make sure your pipeline executes efficiently=
.<br><br>On the Crunch/MR side, I&#39;m a little surprised that we&#39;re o=
nly executing the map-only jobs one at a time-- I&#39;m assuming you&#39;re=
 not mucking with the <a href=3D"http://crunch.max.running.jobs">crunch.max=
.running.jobs</a> parameter in some way? (see:=C2=A0<a href=3D"https://crun=
ch.apache.org/user-guide.html#mrpipeline">https://crunch.apache.org/user-gu=
ide.html#mrpipeline</a> )<div><br><div><br></div><div>J</div></div></div><d=
iv class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Sat, Jul 16, 201=
6 at 7:29 PM, David Ortiz <span dir=3D"ltr">&lt;<a href=3D"mailto:dpo5003@g=
mail.com" target=3D"_blank">dpo5003@gmail.com</a>&gt;</span> wrote:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex"><p dir=3D"ltr">Hmm.=C2=A0 Just out of curiosity,=
 what if you do Pipeline.read in place of readTextFile? </p><div class=3D"H=
OEnZb"><div class=3D"h5">
<br><div class=3D"gmail_quote"><div dir=3D"ltr">On Sat, Jul 16, 2016, 10:08=
 PM Ben Juhn &lt;<a href=3D"mailto:benjijuhn@gmail.com" target=3D"_blank">b=
enjijuhn@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><=
div style=3D"word-wrap:break-word">Nope, it queues up the jobs in series th=
ere too.</div><div style=3D"word-wrap:break-word"><div><br><div><blockquote=
 type=3D"cite"><div>On Jul 16, 2016, at 6:01 PM, David Ortiz &lt;<a href=3D=
"mailto:dpo5003@gmail.com" target=3D"_blank">dpo5003@gmail.com</a>&gt; wrot=
e:</div><br><div><p dir=3D"ltr">*run in parallel </p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr">On Sat, Jul 16, 2016, 5:36 =
PM David Ortiz &lt;<a href=3D"mailto:dpo5003@gmail.com" target=3D"_blank">d=
po5003@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p =
dir=3D"ltr">Just out of curiosity, if you use mrpipeline does it fun on par=
allel?=C2=A0 If so, issue may be in spark since I believe crunch leaves it =
to spark to handle best method of execution. </p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr">On Sat, Jul 16, 2016, 4:29 =
PM Ben Juhn &lt;<a href=3D"mailto:benjijuhn@gmail.com" target=3D"_blank">be=
njijuhn@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><d=
iv style=3D"word-wrap:break-word">Hey David,<div><br></div><div>I have 100 =
active executors, each job typically only uses a few.=C2=A0 It=E2=80=99s ru=
nning on yarn.</div><div><br></div><div>Thanks,</div><div>Ben</div></div><d=
iv style=3D"word-wrap:break-word"><div><br><div><blockquote type=3D"cite"><=
div>On Jul 16, 2016, at 12:53 PM, David Ortiz &lt;<a href=3D"mailto:dpo5003=
@gmail.com" target=3D"_blank">dpo5003@gmail.com</a>&gt; wrote:</div><br><di=
v><p dir=3D"ltr">What are the cluster resources available vs what a single =
map uses? </p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr">On Sat, Jul 16, 2016, 3:04 =
PM Ben Juhn &lt;<a href=3D"mailto:benjijuhn@gmail.com" target=3D"_blank">be=
njijuhn@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><d=
iv style=3D"word-wrap:break-word">I enabled FAIR scheduling hoping that wou=
ld help but only one job is showing up a time.<div><br></div><div>Thanks,</=
div><div>Ben</div></div><div style=3D"word-wrap:break-word"><div><br><div><=
blockquote type=3D"cite"><div>On Jul 15, 2016, at 8:17 PM, Ben Juhn &lt;<a =
href=3D"mailto:benjijuhn@gmail.com" target=3D"_blank">benjijuhn@gmail.com</=
a>&gt; wrote:</div><br><div><div style=3D"word-wrap:break-word">Each input =
is of a different format, and the DoFn implementation handles them dependin=
g on instantiation parameters.<div><br></div><div>Thanks,</div><div>Ben</di=
v><div><br><div><blockquote type=3D"cite"><div>On Jul 15, 2016, at 7:09 PM,=
 Stephen Durfey &lt;<a href=3D"mailto:sjdurfey@gmail.com" target=3D"_blank"=
>sjdurfey@gmail.com</a>&gt; wrote:</div><br><div><div><div>Instead of using=
 readTextFile on the pipeline, try using the read method and use the TextFi=
leSource, which can accept in a collection of paths.=C2=A0<br><br><a href=
=3D"https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/=
org/apache/crunch/io/text/TextFileSource.java" target=3D"_blank">https://gi=
thub.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/cru=
nch/io/text/TextFileSource.java</a><br><div></div><br></div><br><br><br>
<div class=3D"gmail_quote">On Fri, Jul 15, 2016 at 8:53 PM -0500, &quot;Ben=
 Juhn&quot; <span dir=3D"ltr">&lt;<a href=3D"mailto:benjijuhn@gmail.com" ta=
rget=3D"_blank">benjijuhn@gmail.com</a>&gt;</span> wrote:<br>
<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div dir=3D"3D&quot;ltr&quot;">
Hello,<div><br></div><div>I have a job configured the following way:</div><=
div><pre style=3D"background-color:rgb(255,255,255);font-family:Menlo;font-=
size:12pt"><pre style=3D"font-family:Menlo;font-size:12pt"><span style=3D"c=
olor:rgb(0,0,128);font-weight:bold">for </span>(String path : <span style=
=3D"color:rgb(102,14,122);font-weight:bold">paths</span>) {<br>    PCollect=
ion&lt;String&gt; col =3D pipeline.readTextFile(path);<br>    col.parallelD=
o(<span style=3D"color:rgb(0,0,128);font-weight:bold">new </span>MyDoFn(pat=
h), Writables.<span style=3D"font-style:italic">strings</span>()).write(To.=
<span style=3D"font-style:italic">textFile</span>(=E2=80=9Cout/=E2=80=9C + =
<span style=3D"font-size:12pt">path</span>), Target.WriteMode.<span style=
=3D"color:rgb(102,14,122);font-weight:bold;font-style:italic">APPEND</span>=
);<br>}<br><span style=3D"font-size:12pt">pipeline.done();</span></pre><pre=
 style=3D"font-family:Menlo;font-size:12pt">It results in one spark job for=
 each path, and the jobs run in sequence even though there are no dependenc=
ies.  Is it possible to have the jobs run in parallel?</pre><pre style=3D"f=
ont-family:Menlo;font-size:12pt">Thanks,</pre><pre style=3D"font-family:Men=
lo;font-size:12pt">Ben</pre></pre></div><div><br></div>
</div>

</blockquote>
</div>
</div></div></blockquote></div><br></div></div></div></blockquote></div><br=
></div></div></blockquote></div>
</div></blockquote></div><br></div></div></blockquote></div></blockquote></=
div>
</div></blockquote></div><br></div></div></blockquote></div>
</div></div></blockquote></div><br></div>

--001a113cd768a91bea0537cc7cdc--