streams-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sblackmon <>
Subject Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules
Date Thu, 13 Oct 2016 19:33:52 GMT
Preferably in a akka implementation that is lock free or as lock free as possible. 

Flink has a local environment (based on akka) that I’ve found to work very well for integration
testing of flink data pipelines, which I think could also be suitable for testing and running
streams component pipelines.

A few nice things about Flink’s architecture vis-a-vis Apache Streams:
* Sub-second spin-up and tear-down of local environment
* Near-real-time propagation of data between stages (not micro-batch)
* Exactly-once data guarantees from many sources, including kafka and hdfs, via checkpointing
* Ability to create a snapshot (checkpoint) on-demand and restart the pipeline, preserving
state of 
* Most data in-flight is kept in binary blocks that spill to disk, so no out of heap crashes
even when memory is constrained.

On October 12, 2016 at 1:55:18 PM, Ryan Ebanks ( wrote:
Its been a long time since I contributed, but I would like to agree that  
local runtime should be trashed/re-written. Preferably in a akka  
implementation that is lock free or as lock free as possible. The current  
one is not very good, and I say that knowing I wrote a lot of it. I also  
think it should be designed for local testing only and not for production  
use. There are too many other frameworks to use to justify the work needed  
to fix/re-write it to a production standard.  

On Wed, Oct 12, 2016 at 11:21 AM, sblackmon <> wrote:  

> On October 11, 2016 at 3:31:41 PM, Matt Franklin (  
> wrote:  
> On Tue, Sep 27, 2016 at 6:05 PM sblackmon <> wrote:  
> > All,  
> >  
> >  
> >  
> > Joey brought this up over the weekend and I think a discussion is overdue  
> > on the topic.  
> >  
> >  
> >  
> > Streams components were meant to be compatible with other runtime  
> > frameworks all along, and for the most part are implemented in a manner  
> > compatible with distributed execution where coordination, message  
> passing,  
> > and lifecycle and handled outside of streams libraries. By community  
> > standards any component or component configuration object that doesn't  
> > cleanly serializable for relocation in a distributed framework is a bug.  
> >  
> Agreed, though this could be more explicit.  
> Some modules contain a unit test that checks for serializability of  
> components. Maybe we can find a way to systematize this such that every  
> Provider, Processor, and Persister added to the code base gets run through  
> a serializability check during mvn test. We could try to catch up by  
> adding similar tests throughout the code base and -1 new submissions that  
> don’t include such a test, but that approach seems harder than doing  
> something using Reflections.  
> >  
> >  
> >  
> > When the streams project got started in 2012 storm was the only TLP  
> > real-time data processing framework at apache, but now there are plenty  
> of  
> > good choices all of which are faster and better tested than our  
> > streams-runtime-local module.  
> >  
> >  
> >  
> > So, what should be the role of streams-runtime-local? Should we keep it  
> > at all? The tests take forever to run and my organization has stopped  
> > using it entirely. The best argument for keeping it is that it is useful  
> > when integration testing small pipelines, but perhaps we could just agree  
> > to use something else for that purpose?  
> >  
> >  
> I think having a local runtime for testing or small streams is valuable,  
> but there is a ton of work that needs to go into the current runtime.  
> Yeah, the magnitude of that effort is why it might be worth considering  
> starting from scratch. We need a quality testing harness runtime at a  
> minimum. local is suitable for that, barely.  
> >  
> >  
> > Do we want to keep the other runtime modules around and continue adding  
> > more? I’ve found that when embedding streams components in other  
> > frameworks (spark and flink most recently) I end up creating a handful of  
> > classes to help bind streams interfaces and instances within the pdfs /  
> > functions / transforms / whatever are that framework atomic unit of  
> > computation and reusing them in all my pipelines.  
> >  
> >  
> I think this is valuable. A set of libraries that adapt a common  
> programming model to various frameworks that simply stream development is  
> inherently cool. Write once, run anywhere.  
> It’s a cool idea, but I’ve never successfully used it that way. Also as  
> soon as you bring a batch framework like pig, spark, or flink into your  
> design, streams persisters quickly become irrelevant because performance is  
> usually better using the framework preferred libraries. Streaming  
> frameworks not as much but there’s a trade-off to consider with every  
> integration point and I’ve found pretty much universally that the  
> framework-maintained libraries tend to be faster.  
> >  
> >  
> > How about the StreamBuilder interface? Does anyone still believe we  
> > should support (and still want to work on) classes  
> > implementing StreamBuilder to build and running a pipeline comprised  
> solely  
> > of streams components on other frameworks? Personally I prefer to write  
> > code using the framework APIs at the pipeline level, and embed individual  
> > streams components at the step level.  
> >  
> >  
> I think this could be valuable if done better. For instance, binding  
> classes to steps in the stream pipeline, rather than instances. This would  
> let the aforementioned adapter libraries configure components using the  
> programming model declared by streams and setup pipelines in target  
> systems.  
> It’s a cool idea, but i think down that path we’d wind up with pretty  
> beefy runtimes, loss of clean separation between modules, and unwanted  
> transitive dependencies. For example, an hdfs persist reader embedded in a  
> spark pipeline should be interpreted as sc.readTextFile / readSequenceFile,  
> or else spark.* properties that determine read behavior won’t be picked  
> up. An elastic search persist writer embedded in a flink pipeline should  
> be interpreted to use org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink.
> Or just maybe org.apache.flink.streaming.connectors.elasticsearch2.ElasticsearchSink
> on what’s available on the class path. Pretty soon each runtime becomes a  
> crazy hard to test monolith because it has to be to properly optimize how  
> it interprets each component. That’s my fear anyway and it’s why I’m  
> leaning more toward runtimes that don’t have a StreamBuilder at all, and  
> just provide configuration support and static helper methods.  
> >  
> >  
> > Any other thoughts on the topic?  
> >  
> >  
> >  
> > Steve  
> >  
> >  
> >  
> > - What should the focus be? If you look at the code, the project really  
> > provides 3 things: (1) a stream processing engine and integration with  
> data  
> > persistence mechanisms, (2) a reference implementation of  
> ActivityStreams,  
> > AS schemas, and tools for interlinking activity objects and events, and  
> (3)  
> > a uniform API for integrating with social network APIs. I don't think  
> that  
> > first thing is needed anymore. Just looking at Apache projects, NiFi,  
> Apex  
> > + Apex Malhar, and to some extent Flume are further along here. Stream  
> Sets  
> > covers some of this too, and arguably Logstash also gets used for this  
> sort  
> > of work. I.e., I think the project would be much stronger if it focused  
> on  
> > (2) and (3) and marrying those up to other Apache projects that fit (1).  
> > Minimally, it needs to be de-entangled a bit.  

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message