Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 496272009F9 for ; Mon, 23 May 2016 21:59:53 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 47F67160A0E; Mon, 23 May 2016 19:59:53 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 415F1160A05 for ; Mon, 23 May 2016 21:59:52 +0200 (CEST) Received: (qmail 92845 invoked by uid 500); 23 May 2016 19:59:51 -0000 Mailing-List: contact user-help@beam.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@beam.incubator.apache.org Delivered-To: mailing list user@beam.incubator.apache.org Received: (qmail 92835 invoked by uid 99); 23 May 2016 19:59:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 May 2016 19:59:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 09388C03BC for ; Mon, 23 May 2016 19:59:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.299 X-Spam-Level: * X-Spam-Status: No, score=1.299 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=smokinghand-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id fDfZT0u4cdG4 for ; Mon, 23 May 2016 19:59:48 +0000 (UTC) Received: from mail-oi0-f48.google.com (mail-oi0-f48.google.com [209.85.218.48]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 864765F471 for ; Mon, 23 May 2016 19:59:48 +0000 (UTC) Received: by mail-oi0-f48.google.com with SMTP id j1so57659859oih.3 for ; Mon, 23 May 2016 12:59:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=smokinghand-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=TdsNW7s3pMGbsf2Fy6nDsNMSp4HiFmxdmA+z68FSkmQ=; b=y4t3lEE5wE6pzETHrdN/sIV2QyxZQ8Oad3Zn8lF6Ia6IqhuUxtHovx9M8oGuotB0ov BkNODfYWVq6+hfDT7t5IwxPLxxFEQ5gexFHGSd6QWAN5YcAzKzaNxEylwJTY0lvN4i4U 49xOeYrIFaKZqUsbXLTbmhiG74xkexZzAbK4vDxAb+N55B6NJiliq5horkuRRsX0EkzY K891Kbbgx4WVItv4gpcOOGjOS8muizyFoR4b9i3tDZg7V+07GSfbBmAlZ8xCflAPfzi7 0FvJGbG525chsplBu8rbMH5HCqb0Pvmtpu97mdfQAkHqG3rQclBxde1cpFpirQuuuHFa b8RA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=TdsNW7s3pMGbsf2Fy6nDsNMSp4HiFmxdmA+z68FSkmQ=; b=eks6Dl5bzwL7iPM31fugkFZ/yexEdxqk+nChuV2Sjv0/rn1xn851qR3fP1k4kbR1DT oc1K93Uj8UKDotkptSxvkjBbpIKVXPeEbKG+aXhZMMHkMKQh+ps+DILX/uU/Z29aFTkz /6/6P6maEEIuPh41JcVGQPBNqnJv7x8Hhkuso0F7/d7/ZhTxQ3NeApK6aGFCKdpDe2RD u7SlmvYJfxpkffOKr9wcramkRokUbg9TcM6N7Dtv32PDyk42UfNn04PBt5Z49XJDbXK/ ZPU3fNEXkqb2EBxiAzFYlepFucYxL/56laXb8lYj+G+0HxB2diB71HpftnOAH2x+P+4u PgEQ== X-Gm-Message-State: ALyK8tJdh3YJbU8y2wrj+mfv78n0yxNPhVSoBZ+PGif7yYuDJ5ziMoyCpsx3cA6fUHflNUXU3MCHXN9xXm6bmA== X-Received: by 10.157.4.173 with SMTP id 42mr389550otm.1.1464033587561; Mon, 23 May 2016 12:59:47 -0700 (PDT) MIME-Version: 1.0 References: <574093E3.4080209@nanthrax.net> <574309DE.8080206@nanthrax.net> In-Reply-To: From: Jesse Anderson Date: Mon, 23 May 2016 19:59:38 +0000 Message-ID: Subject: Re: Force pipe executions to run on same node To: "user@beam.incubator.apache.org" Content-Type: multipart/alternative; boundary=94eb2c09d560ed1b02053387e162 archived-at: Mon, 23 May 2016 19:59:53 -0000 --94eb2c09d560ed1b02053387e162 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Benjamin, Sorry, the success and failures are a bit too nuanced for an email. A quick check on average CAD files says they're around 1 MB. That'd be a poor use of HDFS. Thanks, Jesse On Mon, May 23, 2016 at 11:08 AM Stadin, Benjamin < Benjamin.Stadin@heidelberg-mobil.com> wrote: > Hi Jesse, > > Yes, this is what I=E2=80=99m looking for. I want to deploy and run the s= ame code, > mostly written in Python as well as C++, on different nodes. I also want = to > benefit from the job distribution and job monitoring / administration > capabilities. I only need parallelization to a minor degree later. > > Though I=E2=80=99m hesitant to use HDFS, or any other distributed file sy= stem. > Since I process the data only on one node, it will probably be big > disadvantage for this data to be distributed to other nodes as well via > HDFS. > > Could you maybe share some info about the successful implementations and > configurations of such distributed job engine? > > Thanks > Ben > > Von: Jesse Anderson > Antworten an: "user@beam.incubator.apache.org" < > user@beam.incubator.apache.org> > Datum: Montag, 23. Mai 2016 um 19:22 > An: "user@beam.incubator.apache.org" > Betreff: Re: Force pipe executions to run on same node > > Benjamin, > > I've had a few students using Big Data frameworks as a distributed job > engine. They work in varying degrees of success. > > With Beam, your success will really depend on the runner as JB said. If I > understand your use case correctly, if you were using Hadoop MapReduce, > you'd be using a map-only job. Beam would give you the ability to run the > same code on several different execution engines. If that isn't your goal= , > you might look elsewhere. > > Thanks, > > Jesse > > On Mon, May 23, 2016 at 6:47 AM Jean-Baptiste Onofr=C3=A9 > wrote: > >> Hi Benjamin, >> >> Your data processing doesn't seem to be fully big data oriented and >> distributed. >> >> Maybe Apache Camel is more appropriate for such scenario. You can always >> delegate part of the data processing to Beam from Camel (using Kafka >> topic for instance). >> >> Regards >> JB >> >> On 05/22/2016 11:01 PM, Stadin, Benjamin wrote: >> > Hi JB, >> > >> > None so far. I=C2=B9m still thinking about how to achieve what I want = to do, >> > and whether Beam makes sense for my usage scenario. >> > >> > I=C2=B9m mostly interested to just orchestrate tasks to individual mac= hines >> and >> > service endpoints, depending on their workload. My application is not = so >> > much about Big Data and parallelism, but local data processing and loc= al >> > parallelization. >> > >> > An example scenario: >> > - A user uploads a set of CAD files >> > - data from CAD files are extracted in parallel >> > - a whole bunch of native tools operate on this extracted data set in = an >> > own pipe. Due to the amount of data generated and consumed, it doesn= =C2=B9t >> > make sense at all to distribute these tasks to other machines. It=C2= =B9s very >> > IO bound. >> > - For the same reason, it doesn=C2=B9t make sense to distribute data u= sing >> RDD. >> > It=C2=B9s rather favorable to do only some tasks (such as CAD data >> extraction) >> > in parallel, otherwise run other data tasks as a group on a single nod= e, >> > in order to avoid IO bottle necks. >> > >> > So I don=C2=B9t have a typical Big Data processing in mind. What I=C2= =B9m looking >> > for is rather an integrated environment to provide only some kind of >> > parallel task execution, and task management and administration, as we= ll >> > as a message bus and event system. >> > >> > Is Beam a choice for such rather non-Big-Data scenario? >> > >> > Regards, >> > Ben >> > >> > >> > Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofr=C3=A9" unter < >> jb@nanthrax.net>: >> > >> >> Hi Ben, >> >> >> >> it's not SDK related, it's more depend on the runner. >> >> >> >> What runner are you using ? >> >> >> >> Regards >> >> JB >> >> >> >> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote: >> >>> Hi, >> >>> >> >>> I need to control beam pipes/filters so that pipe executions that >> match >> >>> a certain criteria are executed on the same node. >> >>> >> >>> In Spring XD this can be controlled by defining groups >> >>> >> >>> ( >> http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo >> >>> yment) >> >>> and then specify deployment criteria to match this group. >> >>> >> >>> Is this possible with Beam? >> >>> >> >>> Best >> >>> Ben >> >> >> >> -- >> >> Jean-Baptiste Onofr=C3=A9 >> >> jbonofre@apache.org >> >> http://blog.nanthrax.net >> >> Talend - http://www.talend.com >> > >> >> -- >> Jean-Baptiste Onofr=C3=A9 >> jbonofre@apache.org >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> > --94eb2c09d560ed1b02053387e162 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Benjamin,

Sorry, the success and failur= es are a bit too nuanced for an email.

A quick che= ck on average CAD files says they're around 1 MB. That'd be a poor = use of HDFS.

Thanks,

Jess= e

On Mon, May 23= , 2016 at 11:08 AM Stadin, Benjamin <Benjamin.Stadin@heidelberg-mobil.com> wrote:
Hi Jesse,

Yes, this is what I=E2=80=99m looking for. I want to deploy and run th= e same code, mostly written in Python as well as C++, on different nodes. I= also want to benefit from the job distribution and job monitoring / admini= stration capabilities. I only need parallelization to a minor degree later.=C2=A0

Though I=E2=80=99m hesitant to use HDFS, or any other distributed file= system. Since I process the data only on one node, it will probably be big= disadvantage for this data to be distributed to other nodes as well via HD= FS.=C2=A0

Could you maybe share some info about the successful implementations a= nd configurations of such distributed job engine?

Thanks
Ben

Von: Jesse Anderson <jesse@smokinghand.com&g= t;
Antworten an: "user@beam.incubator.ap= ache.org" <user@beam.incubator.apache.org>
Datum: Montag, 23. Mai 2016 um 19:2= 2
An: "user@beam.incubator.apache.org" <user@beam.incubator.apache.org>
Betreff: Re: Force pipe executions = to run on same node

Benjamin,

I've had a few students using Big Data frameworks as a distributed= job engine. They work in varying degrees of success.

With Beam, your success will really depend on the runner as JB said. I= f I understand your use case correctly, if you were using Hadoop MapReduce,= you'd be using a map-only job. Beam would give you the ability to run = the same code on several different execution engines. If that isn't your goal, you might look elsewhere.

Thanks,

Jesse

On Mon, May 23, 2016 at 6:47 AM Jean-Baptiste Onofr=C3=A9 = <jb@nanthrax.net> wrote:
Hi Benjamin,

Your data processing doesn't seem to be fully big data oriented and
distributed.

Maybe Apache Camel is more appropriate for such scenario. You can always delegate part of the data processing to Beam from Camel (using Kafka
topic for instance).

Regards
JB

On 05/22/2016 11:01 PM, Stadin, Benjamin wrote:
> Hi JB,
>
> None so far. I=C2=B9m still thinking about how to achieve what I want = to do,
> and whether Beam makes sense for my usage scenario.
>
> I=C2=B9m mostly interested to just orchestrate tasks to individual mac= hines and
> service endpoints, depending on their workload. My application is not = so
> much about Big Data and parallelism, but local data processing and loc= al
> parallelization.
>
> An example scenario:
> - A user uploads a set of CAD files
> - data from CAD files are extracted in parallel
> - a whole bunch of native tools operate on this extracted data set in = an
> own pipe. Due to the amount of data generated and consumed, it doesn= =C2=B9t
> make sense at all to distribute these tasks to other machines. It=C2= =B9s very
> IO bound.
> - For the same reason, it doesn=C2=B9t make sense to distribute data u= sing RDD.
> It=C2=B9s rather favorable to do only some tasks (such as CAD data ext= raction)
> in parallel, otherwise run other data tasks as a group on a single nod= e,
> in order to avoid IO bottle necks.
>
> So I don=C2=B9t have a typical Big Data processing in mind. What I=C2= =B9m looking
> for is rather an integrated environment to provide only some kind of > parallel task execution, and task management and administration, as we= ll
> as a message bus and event system.
>
> Is Beam a choice for such rather non-Big-Data scenario?
>
> Regards,
> Ben
>
>
> Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofr=C3=A9" unter= <
jb@nanthrax.net>:
>
>> Hi Ben,
>>
>> it's not SDK related, it's more depend on the runner.
>>
>> What runner are you using ?
>>
>> Regards
>> JB
>>
>> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>>> Hi,
>>>
>>> I need to control beam pipes/filters so that pipe executions t= hat match
>>> a certain criteria are executed on the same node.
>>>
>>> In Spring XD this can be controlled by defining groups
>>>
>>> (
http://docs.sp= ring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
>>> yment)
>>> and then specify deployment criteria to match this group.
>>>
>>> Is this possible with Beam?
>>>
>>> Best
>>> Ben
>>
>> --
>> Jean-Baptiste Onofr=C3=A9
>> jbonofre@= apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>

--
Jean-Baptiste Onofr=C3=A9
jbonofre@apache.or= g
h= ttp://blog.nanthrax.net
Talend - http://www.talend.com
--94eb2c09d560ed1b02053387e162--