Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
MIME-Version: 1.0
References: <B47A4562-E40D-4575-B1CF-DD6F5ED957B1@gmail.com>
 <7C1189D566F1FED2.1C8D5E2A-8BDE-4FB4-B524-67BF2E78B596@mail.outlook.com>
 <704129E5-5F14-467B-802E-8DBB3768CE56@gmail.com> <8A4341CE-475A-4624-9232-1AA9031BBD44@gmail.com>
 <CAC79LcZB5bNcYgK6cjS+GhQDasDqugr9yEWcLn6CAn73df4dLg@mail.gmail.com>
 <14F65134-3716-4E15-8D05-D3CCFBCBCDCB@gmail.com> <CAC79LcbyOTGdwu9+gL=vudVrU+m_JqPwG61DKS_BCTg-_Q4HxA@mail.gmail.com>
 <CAC79LcahDCos-HCMnmxmLTq6Ly6yUQyurapxMWH7XmoMf9CKeg@mail.gmail.com> <CDEED80B-7736-46B7-8814-A8C53CD2FAE5@gmail.com>
In-Reply-To: <CDEED80B-7736-46B7-8814-A8C53CD2FAE5@gmail.com>
From: David Ortiz <dpo5003@gmail.com>
Date: Sun, 17 Jul 2016 02:29:13 +0000
Message-ID: <CAC79LcZE3CFn9kmcAHzvKJgsXX+MdVrBkOV9GaN+YJhVL+tzQA@mail.gmail.com>
Subject: Re: Processing many map only collections in single pipeline with spark
To: user@crunch.apache.org
Content-Type: multipart/alternative; boundary=94eb2c04a0d2a6f9fe0537cb9e30
archived-at: Sun, 17 Jul 2016 02:29:29 -0000

--94eb2c04a0d2a6f9fe0537cb9e30
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hmm.  Just out of curiosity, what if you do Pipeline.read in place of
readTextFile?

On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <benjijuhn@gmail.com> wrote:

> Nope, it queues up the jobs in series there too.
>
> On Jul 16, 2016, at 6:01 PM, David Ortiz <dpo5003@gmail.com> wrote:
>
> *run in parallel
>
> On Sat, Jul 16, 2016, 5:36 PM David Ortiz <dpo5003@gmail.com> wrote:
>
>> Just out of curiosity, if you use mrpipeline does it fun on parallel?  I=
f
>> so, issue may be in spark since I believe crunch leaves it to spark to
>> handle best method of execution.
>>
>> On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <benjijuhn@gmail.com> wrote:
>>
>>> Hey David,
>>>
>>> I have 100 active executors, each job typically only uses a few.  It=E2=
=80=99s
>>> running on yarn.
>>>
>>> Thanks,
>>> Ben
>>>
>>> On Jul 16, 2016, at 12:53 PM, David Ortiz <dpo5003@gmail.com> wrote:
>>>
>>> What are the cluster resources available vs what a single map uses?
>>>
>>> On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <benjijuhn@gmail.com> wrote:
>>>
>>>> I enabled FAIR scheduling hoping that would help but only one job is
>>>> showing up a time.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>> On Jul 15, 2016, at 8:17 PM, Ben Juhn <benjijuhn@gmail.com> wrote:
>>>>
>>>> Each input is of a different format, and the DoFn implementation
>>>> handles them depending on instantiation parameters.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <sjdurfey@gmail.com> wrote=
:
>>>>
>>>> Instead of using readTextFile on the pipeline, try using the read
>>>> method and use the TextFileSource, which can accept in a collection of
>>>> paths.
>>>>
>>>>
>>>> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java=
/org/apache/crunch/io/text/TextFileSource.java
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <benjijuhn@gmail.com=
>
>>>> wrote:
>>>>
>>>> Hello,
>>>>>
>>>>> I have a job configured the following way:
>>>>>
>>>>> for (String path : paths) {
>>>>>     PCollection<String> col =3D pipeline.readTextFile(path);
>>>>>     col.parallelDo(new MyDoFn(path), Writables.strings()).write(To.te=
xtFile(=E2=80=9Cout/=E2=80=9C + path), Target.WriteMode.APPEND);
>>>>> }
>>>>> pipeline.done();
>>>>>
>>>>> It results in one spark job for each path, and the jobs run in sequen=
ce even though there are no dependencies.  Is it possible to have the jobs =
run in parallel?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ben
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>

--94eb2c04a0d2a6f9fe0537cb9e30
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Hmm.=C2=A0 Just out of curiosity, what if you do Pipeline.re=
ad in place of readTextFile? </p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr">On Sat, Jul 16, 2016, 10:08=
 PM Ben Juhn &lt;<a href=3D"mailto:benjijuhn@gmail.com">benjijuhn@gmail.com=
</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:=
0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style=3D"word-=
wrap:break-word">Nope, it queues up the jobs in series there too.</div><div=
 style=3D"word-wrap:break-word"><div><br><div><blockquote type=3D"cite"><di=
v>On Jul 16, 2016, at 6:01 PM, David Ortiz &lt;<a href=3D"mailto:dpo5003@gm=
ail.com" target=3D"_blank">dpo5003@gmail.com</a>&gt; wrote:</div><br><div><=
p dir=3D"ltr">*run in parallel </p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr">On Sat, Jul 16, 2016, 5:36 =
PM David Ortiz &lt;<a href=3D"mailto:dpo5003@gmail.com" target=3D"_blank">d=
po5003@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p =
dir=3D"ltr">Just out of curiosity, if you use mrpipeline does it fun on par=
allel?=C2=A0 If so, issue may be in spark since I believe crunch leaves it =
to spark to handle best method of execution. </p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr">On Sat, Jul 16, 2016, 4:29 =
PM Ben Juhn &lt;<a href=3D"mailto:benjijuhn@gmail.com" target=3D"_blank">be=
njijuhn@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><d=
iv style=3D"word-wrap:break-word">Hey David,<div><br></div><div>I have 100 =
active executors, each job typically only uses a few.=C2=A0 It=E2=80=99s ru=
nning on yarn.</div><div><br></div><div>Thanks,</div><div>Ben</div></div><d=
iv style=3D"word-wrap:break-word"><div><br><div><blockquote type=3D"cite"><=
div>On Jul 16, 2016, at 12:53 PM, David Ortiz &lt;<a href=3D"mailto:dpo5003=
@gmail.com" target=3D"_blank">dpo5003@gmail.com</a>&gt; wrote:</div><br><di=
v><p dir=3D"ltr">What are the cluster resources available vs what a single =
map uses? </p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr">On Sat, Jul 16, 2016, 3:04 =
PM Ben Juhn &lt;<a href=3D"mailto:benjijuhn@gmail.com" target=3D"_blank">be=
njijuhn@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><d=
iv style=3D"word-wrap:break-word">I enabled FAIR scheduling hoping that wou=
ld help but only one job is showing up a time.<div><br></div><div>Thanks,</=
div><div>Ben</div></div><div style=3D"word-wrap:break-word"><div><br><div><=
blockquote type=3D"cite"><div>On Jul 15, 2016, at 8:17 PM, Ben Juhn &lt;<a =
href=3D"mailto:benjijuhn@gmail.com" target=3D"_blank">benjijuhn@gmail.com</=
a>&gt; wrote:</div><br><div><div style=3D"word-wrap:break-word">Each input =
is of a different format, and the DoFn implementation handles them dependin=
g on instantiation parameters.<div><br></div><div>Thanks,</div><div>Ben</di=
v><div><br><div><blockquote type=3D"cite"><div>On Jul 15, 2016, at 7:09 PM,=
 Stephen Durfey &lt;<a href=3D"mailto:sjdurfey@gmail.com" target=3D"_blank"=
>sjdurfey@gmail.com</a>&gt; wrote:</div><br><div><div><div>Instead of using=
 readTextFile on the pipeline, try using the read method and use the TextFi=
leSource, which can accept in a collection of paths.=C2=A0<br><br><a href=
=3D"https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/=
org/apache/crunch/io/text/TextFileSource.java" target=3D"_blank">https://gi=
thub.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/cru=
nch/io/text/TextFileSource.java</a><br><div></div><br></div><br><br><br>
<div class=3D"gmail_quote">On Fri, Jul 15, 2016 at 8:53 PM -0500, &quot;Ben=
 Juhn&quot; <span dir=3D"ltr">&lt;<a href=3D"mailto:benjijuhn@gmail.com" ta=
rget=3D"_blank">benjijuhn@gmail.com</a>&gt;</span> wrote:<br>
<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div dir=3D"3D&quot;ltr&quot;">
Hello,<div><br></div><div>I have a job configured the following way:</div><=
div><pre style=3D"background-color:rgb(255,255,255);font-family:Menlo;font-=
size:12pt"><pre style=3D"font-family:Menlo;font-size:12pt"><span style=3D"c=
olor:rgb(0,0,128);font-weight:bold">for </span>(String path : <span style=
=3D"color:rgb(102,14,122);font-weight:bold">paths</span>) {<br>    PCollect=
ion&lt;String&gt; col =3D pipeline.readTextFile(path);<br>    col.parallelD=
o(<span style=3D"color:rgb(0,0,128);font-weight:bold">new </span>MyDoFn(pat=
h), Writables.<span style=3D"font-style:italic">strings</span>()).write(To.=
<span style=3D"font-style:italic">textFile</span>(=E2=80=9Cout/=E2=80=9C + =
<span style=3D"font-size:12pt">path</span>), Target.WriteMode.<span style=
=3D"color:rgb(102,14,122);font-weight:bold;font-style:italic">APPEND</span>=
);<br>}<br><span style=3D"font-size:12pt">pipeline.done();</span></pre><pre=
 style=3D"font-family:Menlo;font-size:12pt">It results in one spark job for=
 each path, and the jobs run in sequence even though there are no dependenc=
ies.  Is it possible to have the jobs run in parallel?</pre><pre style=3D"f=
ont-family:Menlo;font-size:12pt">Thanks,</pre><pre style=3D"font-family:Men=
lo;font-size:12pt">Ben</pre></pre></div><div><br></div>
</div>

</blockquote>
</div>
</div></div></blockquote></div><br></div></div></div></blockquote></div><br=
></div></div></blockquote></div>
</div></blockquote></div><br></div></div></blockquote></div></blockquote></=
div>
</div></blockquote></div><br></div></div></blockquote></div>

--94eb2c04a0d2a6f9fe0537cb9e30--