streams-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sblackmon <>
Subject Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules
Date Wed, 12 Oct 2016 16:21:26 GMT
On October 11, 2016 at 3:31:41 PM, Matt Franklin ( wrote:
On Tue, Sep 27, 2016 at 6:05 PM sblackmon <> wrote:  

> All,  
> Joey brought this up over the weekend and I think a discussion is overdue  
> on the topic.  
> Streams components were meant to be compatible with other runtime  
> frameworks all along, and for the most part are implemented in a manner  
> compatible with distributed execution where coordination, message passing,  
> and lifecycle and handled outside of streams libraries. By community  
> standards any component or component configuration object that doesn't  
> cleanly serializable for relocation in a distributed framework is a bug.  

Agreed, though this could be more explicit.  

Some modules contain a unit test that checks for serializability of components.  Maybe we
can find a way to systematize this such that every Provider, Processor, and Persister added
to the code base gets run through a serializability check during mvn test.  We could try
to catch up by adding similar tests throughout the code base and -1 new submissions that don’t
include such a test, but that approach seems harder than doing something using Reflections.

> When the streams project got started in 2012 storm was the only TLP  
> real-time data processing framework at apache, but now there are plenty of  
> good choices all of which are faster and better tested than our  
> streams-runtime-local module.  

> So, what should be the role of streams-runtime-local? Should we keep it  
> at all? The tests take forever to run and my organization has stopped  
> using it entirely. The best argument for keeping it is that it is useful  
> when integration testing small pipelines, but perhaps we could just agree  
> to use something else for that purpose?  
I think having a local runtime for testing or small streams is valuable,  
but there is a ton of work that needs to go into the current runtime.  

Yeah, the magnitude of that effort is why it might be worth considering starting from scratch.
 We need a quality testing harness runtime at a minimum. local is suitable for that, barely.

> Do we want to keep the other runtime modules around and continue adding  
> more? I’ve found that when embedding streams components in other  
> frameworks (spark and flink most recently) I end up creating a handful of  
> classes to help bind streams interfaces and instances within the pdfs /  
> functions / transforms / whatever are that framework atomic unit of  
> computation and reusing them in all my pipelines.  
I think this is valuable. A set of libraries that adapt a common  
programming model to various frameworks that simply stream development is  
inherently cool. Write once, run anywhere.  

It’s a cool idea, but I’ve never successfully used it that way.  Also as soon as you
bring a batch framework like pig, spark, or flink into your design, streams persisters quickly
become irrelevant because performance is usually better using the framework preferred libraries.
 Streaming frameworks not as much but there’s a trade-off to consider with every integration
point and I’ve found pretty much universally that the framework-maintained libraries tend
to be faster.  

> How about the StreamBuilder interface? Does anyone still believe we  
> should support (and still want to work on) classes  
> implementing StreamBuilder to build and running a pipeline comprised solely  
> of streams components on other frameworks? Personally I prefer to write  
> code using the framework APIs at the pipeline level, and embed individual  
> streams components at the step level.  
I think this could be valuable if done better. For instance, binding  
classes to steps in the stream pipeline, rather than instances. This would  
let the aforementioned adapter libraries configure components using the  
programming model declared by streams and setup pipelines in target  

It’s a cool idea, but i think down that path we’d wind up with pretty beefy runtimes,
loss of clean separation between modules, and unwanted transitive dependencies.  For example,
an hdfs persist reader embedded in a spark pipeline should be interpreted as sc.readTextFile
/ readSequenceFile, or else spark.* properties that determine read behavior won’t be picked
up.  An elastic search persist writer embedded in a flink pipeline should be interpreted
to use org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink. Or just maybe org.apache.flink.streaming.connectors.elasticsearch2.ElasticsearchSink depending
on what’s available on the class path. Pretty soon each runtime becomes a crazy hard to
test monolith because it has to be to properly optimize how it interprets each component.
 That’s my fear anyway and it’s why I’m leaning more toward runtimes that don’t have
a StreamBuilder at all, and just provide configuration support and static helper methods.

> Any other thoughts on the topic? 
> Steve 
> - What should the focus be? If you look at the code, the project really 
> provides 3 things: (1) a stream processing engine and integration with data 
> persistence mechanisms, (2) a reference implementation of ActivityStreams, 
> AS schemas, and tools for interlinking activity objects and events, and (3) 
> a uniform API for integrating with social network APIs. I don't think that 
> first thing is needed anymore. Just looking at Apache projects, NiFi, Apex 
> + Apex Malhar, and to some extent Flume are further along here. Stream Sets 
> covers some of this too, and arguably Logstash also gets used for this sort 
> of work. I.e., I think the project would be much stronger if it focused on 
> (2) and (3) and marrying those up to other Apache projects that fit (1). 
> Minimally, it needs to be de-entangled a bit. 

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message