Mailing-List: contact user-help@beam.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@beam.incubator.apache.org
MIME-Version: 1.0
References: <D3663BC3.37523%benjamin.stadin@heidelberg-mobil.com>
 <574093E3.4080209@nanthrax.net> <D367B387.3756A%benjamin.stadin@heidelberg-mobil.com>
 <574309DE.8080206@nanthrax.net> <CAOdmQRhmHZnuizMtEYwTHJj84Db8Y0mXiN5FSQnjhR9J2Av7yA@mail.gmail.com>
 <D3690DA1.37757%benjamin.stadin@heidelberg-mobil.com>
In-Reply-To: <D3690DA1.37757%benjamin.stadin@heidelberg-mobil.com>
From: Jesse Anderson <jesse@smokinghand.com>
Date: Mon, 23 May 2016 19:59:38 +0000
Message-ID: <CAOdmQRgOyXa3ZW-+LAkgR5D9AVSPBzMLMCfGbXinr-jTWi0PoA@mail.gmail.com>
Subject: Re: Force pipe executions to run on same node
To: "user@beam.incubator.apache.org" <user@beam.incubator.apache.org>
Content-Type: multipart/alternative; boundary=94eb2c09d560ed1b02053387e162
archived-at: Mon, 23 May 2016 19:59:53 -0000

--94eb2c09d560ed1b02053387e162
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Benjamin,

Sorry, the success and failures are a bit too nuanced for an email.

A quick check on average CAD files says they're around 1 MB. That'd be a
poor use of HDFS.

Thanks,

Jesse

On Mon, May 23, 2016 at 11:08 AM Stadin, Benjamin <
Benjamin.Stadin@heidelberg-mobil.com> wrote:

> Hi Jesse,
>
> Yes, this is what I=E2=80=99m looking for. I want to deploy and run the s=
ame code,
> mostly written in Python as well as C++, on different nodes. I also want =
to
> benefit from the job distribution and job monitoring / administration
> capabilities. I only need parallelization to a minor degree later.
>
> Though I=E2=80=99m hesitant to use HDFS, or any other distributed file sy=
stem.
> Since I process the data only on one node, it will probably be big
> disadvantage for this data to be distributed to other nodes as well via
> HDFS.
>
> Could you maybe share some info about the successful implementations and
> configurations of such distributed job engine?
>
> Thanks
> Ben
>
> Von: Jesse Anderson <jesse@smokinghand.com>
> Antworten an: "user@beam.incubator.apache.org" <
> user@beam.incubator.apache.org>
> Datum: Montag, 23. Mai 2016 um 19:22
> An: "user@beam.incubator.apache.org" <user@beam.incubator.apache.org>
> Betreff: Re: Force pipe executions to run on same node
>
> Benjamin,
>
> I've had a few students using Big Data frameworks as a distributed job
> engine. They work in varying degrees of success.
>
> With Beam, your success will really depend on the runner as JB said. If I
> understand your use case correctly, if you were using Hadoop MapReduce,
> you'd be using a map-only job. Beam would give you the ability to run the
> same code on several different execution engines. If that isn't your goal=
,
> you might look elsewhere.
>
> Thanks,
>
> Jesse
>
> On Mon, May 23, 2016 at 6:47 AM Jean-Baptiste Onofr=C3=A9 <jb@nanthrax.ne=
t>
> wrote:
>
>> Hi Benjamin,
>>
>> Your data processing doesn't seem to be fully big data oriented and
>> distributed.
>>
>> Maybe Apache Camel is more appropriate for such scenario. You can always
>> delegate part of the data processing to Beam from Camel (using Kafka
>> topic for instance).
>>
>> Regards
>> JB
>>
>> On 05/22/2016 11:01 PM, Stadin, Benjamin wrote:
>> > Hi JB,
>> >
>> > None so far. I=C2=B9m still thinking about how to achieve what I want =
to do,
>> > and whether Beam makes sense for my usage scenario.
>> >
>> > I=C2=B9m mostly interested to just orchestrate tasks to individual mac=
hines
>> and
>> > service endpoints, depending on their workload. My application is not =
so
>> > much about Big Data and parallelism, but local data processing and loc=
al
>> > parallelization.
>> >
>> > An example scenario:
>> > - A user uploads a set of CAD files
>> > - data from CAD files are extracted in parallel
>> > - a whole bunch of native tools operate on this extracted data set in =
an
>> > own pipe. Due to the amount of data generated and consumed, it doesn=
=C2=B9t
>> > make sense at all to distribute these tasks to other machines. It=C2=
=B9s very
>> > IO bound.
>> > - For the same reason, it doesn=C2=B9t make sense to distribute data u=
sing
>> RDD.
>> > It=C2=B9s rather favorable to do only some tasks (such as CAD data
>> extraction)
>> > in parallel, otherwise run other data tasks as a group on a single nod=
e,
>> > in order to avoid IO bottle necks.
>> >
>> > So I don=C2=B9t have a typical Big Data processing in mind. What I=C2=
=B9m looking
>> > for is rather an integrated environment to provide only some kind of
>> > parallel task execution, and task management and administration, as we=
ll
>> > as a message bus and event system.
>> >
>> > Is Beam a choice for such rather non-Big-Data scenario?
>> >
>> > Regards,
>> > Ben
>> >
>> >
>> > Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofr=C3=A9" unter <
>> jb@nanthrax.net>:
>> >
>> >> Hi Ben,
>> >>
>> >> it's not SDK related, it's more depend on the runner.
>> >>
>> >> What runner are you using ?
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>> >>> Hi,
>> >>>
>> >>> I need to control beam pipes/filters so that pipe executions that
>> match
>> >>> a certain criteria are executed on the same node.
>> >>>
>> >>> In Spring XD this can be controlled by defining groups
>> >>>
>> >>> (
>> http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
>> >>> yment)
>> >>> and then specify deployment criteria to match this group.
>> >>>
>> >>> Is this possible with Beam?
>> >>>
>> >>> Best
>> >>> Ben
>> >>
>> >> --
>> >> Jean-Baptiste Onofr=C3=A9
>> >> jbonofre@apache.org
>> >> http://blog.nanthrax.net
>> >> Talend - http://www.talend.com
>> >
>>
>> --
>> Jean-Baptiste Onofr=C3=A9
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

--94eb2c09d560ed1b02053387e162
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Benjamin,<div><br></div><div>Sorry, the success and failur=
es are a bit too nuanced for an email.</div><div><br></div><div>A quick che=
ck on average CAD files says they&#39;re around 1 MB. That&#39;d be a poor =
use of HDFS.</div><div><br></div><div>Thanks,</div><div><br></div><div>Jess=
e</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Mon, May 23=
, 2016 at 11:08 AM Stadin, Benjamin &lt;<a href=3D"mailto:Benjamin.Stadin@h=
eidelberg-mobil.com">Benjamin.Stadin@heidelberg-mobil.com</a>&gt; wrote:<br=
></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-=
left:1px #ccc solid;padding-left:1ex">


<div style=3D"word-wrap:break-word;color:rgb(0,0,0);font-size:14px;font-fam=
ily:Calibri,sans-serif">
<div>Hi Jesse,</div>
<div><br>
</div>
<div>Yes, this is what I=E2=80=99m looking for. I want to deploy and run th=
e same code, mostly written in Python as well as C++, on different nodes. I=
 also want to benefit from the job distribution and job monitoring / admini=
stration capabilities. I only need parallelization
 to a minor degree later.=C2=A0</div>
<div><br>
</div>
<div>Though I=E2=80=99m hesitant to use HDFS, or any other distributed file=
 system. Since I process the data only on one node, it will probably be big=
 disadvantage for this data to be distributed to other nodes as well via HD=
FS.=C2=A0</div>
<div><br>
</div>
<div>Could you maybe share some info about the successful implementations a=
nd configurations of such distributed job engine?</div>
<div><br>
</div>
<div>Thanks</div>
<div>Ben</div>
<div><br>
</div>
<span>
<div style=3D"font-family:Calibri;font-size:11pt;text-align:left;color:blac=
k;BORDER-BOTTOM:medium none;BORDER-LEFT:medium none;PADDING-BOTTOM:0in;PADD=
ING-LEFT:0in;PADDING-RIGHT:0in;BORDER-TOP:#b5c4df 1pt solid;BORDER-RIGHT:me=
dium none;PADDING-TOP:3pt">
<span style=3D"font-weight:bold">Von: </span>Jesse Anderson &lt;<a href=3D"=
mailto:jesse@smokinghand.com" target=3D"_blank">jesse@smokinghand.com</a>&g=
t;<br>
<span style=3D"font-weight:bold">Antworten an: </span>&quot;<a href=3D"mail=
to:user@beam.incubator.apache.org" target=3D"_blank">user@beam.incubator.ap=
ache.org</a>&quot; &lt;<a href=3D"mailto:user@beam.incubator.apache.org" ta=
rget=3D"_blank">user@beam.incubator.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Datum: </span>Montag, 23. Mai 2016 um 19:2=
2<br>
<span style=3D"font-weight:bold">An: </span>&quot;<a href=3D"mailto:user@be=
am.incubator.apache.org" target=3D"_blank">user@beam.incubator.apache.org</=
a>&quot; &lt;<a href=3D"mailto:user@beam.incubator.apache.org" target=3D"_b=
lank">user@beam.incubator.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Betreff: </span>Re: Force pipe executions =
to run on same node<br>
</div></span></div><div style=3D"word-wrap:break-word;color:rgb(0,0,0);font=
-size:14px;font-family:Calibri,sans-serif"><span>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr">Benjamin,
<div><br>
</div>
<div>I&#39;ve had a few students using Big Data frameworks as a distributed=
 job engine. They work in varying degrees of success.</div>
<div><br>
</div>
<div>With Beam, your success will really depend on the runner as JB said. I=
f I understand your use case correctly, if you were using Hadoop MapReduce,=
 you&#39;d be using a map-only job. Beam would give you the ability to run =
the same code on several different execution
 engines. If that isn&#39;t your goal, you might look elsewhere.</div>
<div><br>
</div>
<div>Thanks,</div>
<div><br>
</div>
<div>Jesse</div>
</div>
<br>
<div class=3D"gmail_quote">
<div dir=3D"ltr">On Mon, May 23, 2016 at 6:47 AM Jean-Baptiste Onofr=C3=A9 =
&lt;<a href=3D"mailto:jb@nanthrax.net" target=3D"_blank">jb@nanthrax.net</a=
>&gt; wrote:<br>
</div>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Hi Benjamin,<br>
<br>
Your data processing doesn&#39;t seem to be fully big data oriented and<br>
distributed.<br>
<br>
Maybe Apache Camel is more appropriate for such scenario. You can always<br=
>
delegate part of the data processing to Beam from Camel (using Kafka<br>
topic for instance).<br>
<br>
Regards<br>
JB<br>
<br>
On 05/22/2016 11:01 PM, Stadin, Benjamin wrote:<br>
&gt; Hi JB,<br>
&gt;<br>
&gt; None so far. I=C2=B9m still thinking about how to achieve what I want =
to do,<br>
&gt; and whether Beam makes sense for my usage scenario.<br>
&gt;<br>
&gt; I=C2=B9m mostly interested to just orchestrate tasks to individual mac=
hines and<br>
&gt; service endpoints, depending on their workload. My application is not =
so<br>
&gt; much about Big Data and parallelism, but local data processing and loc=
al<br>
&gt; parallelization.<br>
&gt;<br>
&gt; An example scenario:<br>
&gt; - A user uploads a set of CAD files<br>
&gt; - data from CAD files are extracted in parallel<br>
&gt; - a whole bunch of native tools operate on this extracted data set in =
an<br>
&gt; own pipe. Due to the amount of data generated and consumed, it doesn=
=C2=B9t<br>
&gt; make sense at all to distribute these tasks to other machines. It=C2=
=B9s very<br>
&gt; IO bound.<br>
&gt; - For the same reason, it doesn=C2=B9t make sense to distribute data u=
sing RDD.<br>
&gt; It=C2=B9s rather favorable to do only some tasks (such as CAD data ext=
raction)<br>
&gt; in parallel, otherwise run other data tasks as a group on a single nod=
e,<br>
&gt; in order to avoid IO bottle necks.<br>
&gt;<br>
&gt; So I don=C2=B9t have a typical Big Data processing in mind. What I=C2=
=B9m looking<br>
&gt; for is rather an integrated environment to provide only some kind of<b=
r>
&gt; parallel task execution, and task management and administration, as we=
ll<br>
&gt; as a message bus and event system.<br>
&gt;<br>
&gt; Is Beam a choice for such rather non-Big-Data scenario?<br>
&gt;<br>
&gt; Regards,<br>
&gt; Ben<br>
&gt;<br>
&gt;<br>
&gt; Am 21.05.16, 18:59 schrieb &quot;Jean-Baptiste Onofr=C3=A9&quot; unter=
 &lt;<a href=3D"mailto:jb@nanthrax.net" target=3D"_blank">jb@nanthrax.net</=
a>&gt;:<br>
&gt;<br>
&gt;&gt; Hi Ben,<br>
&gt;&gt;<br>
&gt;&gt; it&#39;s not SDK related, it&#39;s more depend on the runner.<br>
&gt;&gt;<br>
&gt;&gt; What runner are you using ?<br>
&gt;&gt;<br>
&gt;&gt; Regards<br>
&gt;&gt; JB<br>
&gt;&gt;<br>
&gt;&gt; On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:<br>
&gt;&gt;&gt; Hi,<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; I need to control beam pipes/filters so that pipe executions t=
hat match<br>
&gt;&gt;&gt; a certain criteria are executed on the same node.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; In Spring XD this can be controlled by defining groups<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; (<a href=3D"http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE=
/reference/html/#deplo" rel=3D"noreferrer" target=3D"_blank">http://docs.sp=
ring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo</a><br>
&gt;&gt;&gt; yment)<br>
&gt;&gt;&gt; and then specify deployment criteria to match this group.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Is this possible with Beam?<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Best<br>
&gt;&gt;&gt; Ben<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Jean-Baptiste Onofr=C3=A9<br>
&gt;&gt; <a href=3D"mailto:jbonofre@apache.org" target=3D"_blank">jbonofre@=
apache.org</a><br>
&gt;&gt; <a href=3D"http://blog.nanthrax.net" rel=3D"noreferrer" target=3D"=
_blank">http://blog.nanthrax.net</a><br>
&gt;&gt; Talend - <a href=3D"http://www.talend.com" rel=3D"noreferrer" targ=
et=3D"_blank">http://www.talend.com</a><br>
&gt;<br>
<br>
--<br>
Jean-Baptiste Onofr=C3=A9<br>
<a href=3D"mailto:jbonofre@apache.org" target=3D"_blank">jbonofre@apache.or=
g</a><br>
<a href=3D"http://blog.nanthrax.net" rel=3D"noreferrer" target=3D"_blank">h=
ttp://blog.nanthrax.net</a><br>
Talend - <a href=3D"http://www.talend.com" rel=3D"noreferrer" target=3D"_bl=
ank">http://www.talend.com</a><br>
</blockquote>
</div>
</div>
</div>
</span></div></blockquote></div>

--94eb2c09d560ed1b02053387e162--