streams-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sblackmon <>
Subject [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules
Date Tue, 27 Sep 2016 22:05:05 GMT

Joey brought this up over the weekend and I think a discussion is overdue on the topic.  

Streams components were meant to be compatible with other runtime frameworks all along, and
for the most part are implemented in a manner compatible with distributed execution where
coordination, message passing, and lifecycle and handled outside of streams libraries.  By
community standards any component or component configuration object that doesn't cleanly serializable
for relocation in a distributed framework is a bug.

When the streams project got started in 2012 storm was the only TLP real-time data processing
framework at apache, but now there are plenty of good choices all of which are faster and
better tested than our streams-runtime-local module.

So, what should be the role of streams-runtime-local?  Should we keep it at all?  The tests
take forever to run and my organization has stopped using it entirely.  The best argument
for keeping it is that it is useful when integration testing small pipelines, but perhaps
we could just agree to use something else for that purpose?

Do we want to keep the other runtime modules around and continue adding more?  I’ve found
that when embedding streams components in other frameworks (spark and flink most recently)
I end up creating a handful of classes to help bind streams interfaces and instances within
the pdfs / functions / transforms / whatever are that framework atomic unit of computation
and reusing them in all my pipelines.

How about the StreamBuilder interface?  Does anyone still believe we should support (and
still want to work on) classes implementing StreamBuilder to build and running a pipeline
comprised solely of streams components on other frameworks?  Personally I prefer to write
code using the framework APIs at the pipeline level, and embed individual streams components
at the step level.

Any other thoughts on the topic?


- What should the focus be? If you look at the code, the project really provides 3 things:
(1) a stream processing engine and integration with data persistence mechanisms, (2) a reference
implementation of ActivityStreams, AS schemas, and tools for interlinking activity objects
and events, and (3) a uniform API for integrating with social network APIs. I don't think
that first thing is needed anymore. Just looking at Apache projects, NiFi, Apex + Apex Malhar,
and to some extent Flume are further along here. Stream Sets covers some of this too, and
arguably Logstash also gets used for this sort of work. I.e., I think the project would be
much stronger if it focused on (2) and (3) and marrying those up to other Apache projects
that fit (1). Minimally, it needs to be de-entangled a bit.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message