flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gyula Fóra <gyula.f...@gmail.com>
Subject Re: Apache Flink and serious streaming stateful processing
Date Tue, 30 Jun 2015 12:23:38 GMT
Hi Krzysztof,

Thank you for your questions, we are happy to help you getting started.

Regarding your questions:

1. There is backpressure for the streams, so if the downstream operators
cannot keep up the sources will slow down.

2. We have support for stateful processing in Flink in many ways you have
described in your question. Unfortunately the docs are down currently but
you should check out the 'Stateful processing' section in the 0.10 docs
(once its back online). We practically support an OperatorState interface
which let's you keep partitioned state by some key and access it from
runtime operators. The states declared using these interfaces are
checkpointed and will be restored on failure. Currently all the states are
stored in-memory but we are planning to extend it to allow writing state
updates to external systems.

I will send you some pointers once the docs are up again.


Krzysztof Zarzycki <k.zarzycki@gmail.com> ezt írta (időpont: 2015. jún.
30., K, 14:07):

> Greetings!
> I'm extremely interested in Apache Flink, I think you're doing really a
> great job! But please allow me to share two things that I would require
> from Apache Flink to consider it as groundbreaking (it is what I need for
> Streaming framework):
> 1. Stream backpressure. When stream processing part does not keep up,
> please pause receiving new data. This is a serious problem in other
> frameworks, like Spark Streaming. Please see the ticket in Spark about it:
> https://issues.apache.org/jira/browse/SPARK-7398
> 2. Support for (serious) stateful processing. What I mean by that is to be
> able to keep state of the application in key-value stores, out-of-core, in
> embedded mode. I want to be able to keep, let's say history of events from
> last two months, grouped & accessible by user_id, and don't want to use
> external database for that (e.g. Cassandra). Communicating with external
> database would kill my performance especially when *reprocessing*
> historical data. And I definitely don't want to escape to batch processing
> (like in Lambda Architecture).
> These two are the most important (IMHO) lacks in Spark Streaming and are
> the reasons I'm not using it. These two are supported by Samza, which in
> code and API is not excellent, but at least allows serious stream
> processing, that does not require repeating the processing pipeline in
> batch (Hadoop).
> I'm looking forward to seeing features like these in Flink. Or they are
> already there and I'm just missing something?
> Thanks!
> Krzysztof Zarzycki

View raw message