Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1ADAC200B9D for ; Thu, 13 Oct 2016 21:34:03 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 19703160AE4; Thu, 13 Oct 2016 19:34:03 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3AE9F160AD2 for ; Thu, 13 Oct 2016 21:34:02 +0200 (CEST) Received: (qmail 61002 invoked by uid 500); 13 Oct 2016 19:34:01 -0000 Mailing-List: contact dev-help@streams.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@streams.incubator.apache.org Delivered-To: mailing list dev@streams.incubator.apache.org Received: (qmail 60978 invoked by uid 99); 13 Oct 2016 19:34:00 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Oct 2016 19:34:00 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 236361A0034 for ; Thu, 13 Oct 2016 19:34:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.999 X-Spam-Level: * X-Spam-Status: No, score=1.999 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 7Tzk6gOGH9k4 for ; Thu, 13 Oct 2016 19:33:56 +0000 (UTC) Received: from mail-oi0-f44.google.com (mail-oi0-f44.google.com [209.85.218.44]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id A35355F119 for ; Thu, 13 Oct 2016 19:33:55 +0000 (UTC) Received: by mail-oi0-f44.google.com with SMTP id m72so111127139oik.3 for ; Thu, 13 Oct 2016 12:33:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:message-id:in-reply-to :references:subject:mime-version; bh=h1N4C1R9ZutHGntbyQlBhN1M8mbEXgd7dWKY1ObamT8=; b=cLSUte/kYPPvSHB7iGW0783gygWiBRez6pTOyR6a8Y6R5ZM4ezyaxBMejP6U16eZBU W28XaDyWIZKAl0gwoiQy4OCXC9j9MZF3Rxit2rtLjdwSO5+uc8Tce/O33vdYVkH+a7ln CX+TBUIjmI0DroJdL3EJkl5t9joutyng7tDwMIcWohVRbs3Pgm0SHH4uAkkOkjS05JYs YuiU3n6hsYDhQBhqFXEaZ/YvALIAZfhyGQtDCpfNAEuyOXDRC0HXt2epwxqJF3ZNliFX zdZSSmzy7JAlxkEZceLkYjPYy5ToCsnO9qXD07Qaw26BbKaHr3CF1ARyCND2An9wLuwc YBNA== X-Gm-Message-State: AA6/9RkPUDukDep10rqLLkFe+0unVTQZYBXnNoesCD7YaETN6JfB3Yz4OKNrGhpEjHB0+A== X-Received: by 10.202.186.130 with SMTP id k124mr5227167oif.67.1476387234329; Thu, 13 Oct 2016 12:33:54 -0700 (PDT) Received: from Steves-MacBook-Pro-3.local.mail (67-198-76-106.dyn.grandenetworks.net. [67.198.76.106]) by smtp.gmail.com with ESMTPSA id x2sm4961538otd.4.2016.10.13.12.33.53 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Oct 2016 12:33:53 -0700 (PDT) Date: Thu, 13 Oct 2016 14:33:52 -0500 From: sblackmon To: dev@streams.incubator.apache.org Cc: Matt Franklin , Ryan Ebanks Message-ID: In-Reply-To: References: Subject: Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules X-Mailer: Airmail (382) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="57ffe1a0_618d83ab_15951" archived-at: Thu, 13 Oct 2016 19:34:03 -0000 --57ffe1a0_618d83ab_15951 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Preferably in a akka=C2=A0implementation that is lock free or as lock fre= e as possible.=C2=A0 =46link has a local environment (based on akka) that I=E2=80=99ve found t= o work very well for integration testing of flink data pipelines, which I= think could also be suitable for testing and running streams component p= ipelines. A few nice things about =46link=E2=80=99s architecture vis-a-vis Apache S= treams: * Sub-second spin-up and tear-down of local environment * Near-real-time propagation of data between stages (not micro-batch) * Exactly-once data guarantees from many sources, including kafka and hdf= s, via checkpointing * Ability to create a snapshot (checkpoint) on-demand and restart the pip= eline, preserving state of=C2=A0 * Most data in-flight is kept in binary blocks that spill to disk, so no = out of heap crashes even when memory is constrained. On October 12, 2016 at 1:55:18 PM, Ryan Ebanks (ryanebanks=40gmail.com) w= rote: Its been a long time since I contributed, but I would like to agree that = =20 local runtime should be trashed/re-written. Preferably in a akka =20 implementation that is lock free or as lock free as possible. The current= =20 one is not very good, and I say that knowing I wrote a lot of it. I also = =20 think it should be designed for local testing only and not for production= =20 use. There are too many other frameworks to use to justify the work neede= d =20 to fix/re-write it to a production standard. =20 On Wed, Oct 12, 2016 at 11:21 AM, sblackmon wrot= e: =20 > On October 11, 2016 at 3:31:41 PM, Matt =46ranklin (m.ben.franklin=40gm= ail.com) =20 > wrote: =20 > On Tue, Sep 27, 2016 at 6:05 PM sblackmon wrot= e: =20 > =20 > > All, =20 > > =20 > > =20 > > =20 > > Joey brought this up over the weekend and I think a discussion is ove= rdue =20 > > on the topic. =20 > > =20 > > =20 > > =20 > > Streams components were meant to be compatible with other runtime =20 > > frameworks all along, and for the most part are implemented in a mann= er =20 > > compatible with distributed execution where coordination, message =20 > passing, =20 > > and lifecycle and handled outside of streams libraries. By community = =20 > > standards any component or component configuration object that doesn'= t =20 > > cleanly serializable for relocation in a distributed framework is a b= ug. =20 > > =20 > =20 > Agreed, though this could be more explicit. =20 > =20 > Some modules contain a unit test that checks for serializability of =20 > components. Maybe we can find a way to systematize this such that every= =20 > Provider, Processor, and Persister added to the code base gets run thro= ugh =20 > a serializability check during mvn test. We could try to catch up by =20 > adding similar tests throughout the code base and -1 new submissions th= at =20 > don=E2=80=99t include such a test, but that approach seems harder than = doing =20 > something using Reflections. =20 > =20 > =20 > > =20 > > =20 > > =20 > > When the streams project got started in 2012 storm was the only TLP =20 > > real-time data processing framework at apache, but now there are plen= ty =20 > of =20 > > good choices all of which are faster and better tested than our =20 > > streams-runtime-local module. =20 > > =20 > > =20 > =20 > > =20 > > So, what should be the role of streams-runtime-local=3F Should we kee= p it =20 > > at all=3F The tests take forever to run and my organization has stopp= ed =20 > > using it entirely. The best argument for keeping it is that it is use= ful =20 > > when integration testing small pipelines, but perhaps we could just a= gree =20 > > to use something else for that purpose=3F =20 > > =20 > > =20 > I think having a local runtime for testing or small streams is valuable= , =20 > but there is a ton of work that needs to go into the current runtime. =20 > =20 > Yeah, the magnitude of that effort is why it might be worth considering= =20 > starting from scratch. We need a quality testing harness runtime at a =20 > minimum. local is suitable for that, barely. =20 > =20 > =20 > > =20 > > =20 > > Do we want to keep the other runtime modules around and continue addi= ng =20 > > more=3F I=E2=80=99ve found that when embedding streams components in = other =20 > > frameworks (spark and flink most recently) I end up creating a handfu= l of =20 > > classes to help bind streams interfaces and instances within the pdfs= / =20 > > functions / transforms / whatever are that framework atomic unit of =20 > > computation and reusing them in all my pipelines. =20 > > =20 > > =20 > I think this is valuable. A set of libraries that adapt a common =20 > programming model to various frameworks that simply stream development = is =20 > inherently cool. Write once, run anywhere. =20 > =20 > It=E2=80=99s a cool idea, but I=E2=80=99ve never successfully used it t= hat way. Also as =20 > soon as you bring a batch framework like pig, spark, or flink into your= =20 > design, streams persisters quickly become irrelevant because performanc= e is =20 > usually better using the framework preferred libraries. Streaming =20 > frameworks not as much but there=E2=80=99s a trade-off to consider with= every =20 > integration point and I=E2=80=99ve found pretty much universally that t= he =20 > framework-maintained libraries tend to be faster. =20 > =20 > =20 > > =20 > > =20 > > How about the StreamBuilder interface=3F Does anyone still believe we= =20 > > should support (and still want to work on) classes =20 > > implementing StreamBuilder to build and running a pipeline comprised = =20 > solely =20 > > of streams components on other frameworks=3F Personally I prefer to w= rite =20 > > code using the framework APIs at the pipeline level, and embed indivi= dual =20 > > streams components at the step level. =20 > > =20 > > =20 > I think this could be valuable if done better. =46or instance, binding = =20 > classes to steps in the stream pipeline, rather than instances. This wo= uld =20 > let the aforementioned adapter libraries configure components using the= =20 > programming model declared by streams and setup pipelines in target =20 > systems. =20 > =20 > It=E2=80=99s a cool idea, but i think down that path we=E2=80=99d wind = up with pretty =20 > beefy runtimes, loss of clean separation between modules, and unwanted = =20 > transitive dependencies. =46or example, an hdfs persist reader embedded= in a =20 > spark pipeline should be interpreted as sc.readText=46ile / readSequenc= e=46ile, =20 > or else spark.* properties that determine read behavior won=E2=80=99t b= e picked =20 > up. An elastic search persist writer embedded in a flink pipeline shoul= d =20 > be interpreted to use org.apache.flink.streaming.connectors.elasticsear= ch.ElasticsearchSink. =20 > Or just maybe org.apache.flink.streaming.connectors.elasticsearch2.Elas= ticsearchSink depending =20 > on what=E2=80=99s available on the class path. Pretty soon each runtime= becomes a =20 > crazy hard to test monolith because it has to be to properly optimize h= ow =20 > it interprets each component. That=E2=80=99s my fear anyway and it=E2=80= =99s why I=E2=80=99m =20 > leaning more toward runtimes that don=E2=80=99t have a StreamBuilder at= all, and =20 > just provide configuration support and static helper methods. =20 > =20 > =20 > > =20 > > =20 > > Any other thoughts on the topic=3F =20 > > =20 > > =20 > > =20 > > Steve =20 > > =20 > > =20 > > =20 > > - What should the focus be=3F If you look at the code, the project re= ally =20 > > provides 3 things: (1) a stream processing engine and integration wit= h =20 > data =20 > > persistence mechanisms, (2) a reference implementation of =20 > ActivityStreams, =20 > > AS schemas, and tools for interlinking activity objects and events, a= nd =20 > (3) =20 > > a uniform API for integrating with social network APIs. I don't think= =20 > that =20 > > first thing is needed anymore. Just looking at Apache projects, Ni=46= i, =20 > Apex =20 > > + Apex Malhar, and to some extent =46lume are further along here. Str= eam =20 > Sets =20 > > covers some of this too, and arguably Logstash also gets used for thi= s =20 > sort =20 > > of work. I.e., I think the project would be much stronger if it focus= ed =20 > on =20 > > (2) and (3) and marrying those up to other Apache projects that fit (= 1). =20 > > Minimally, it needs to be de-entangled a bit. =20 > =20 --57ffe1a0_618d83ab_15951--