flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: streaming state
Date Mon, 14 Dec 2015 10:20:25 GMT
Hi Alex!

Right now, Flink would not reuse Kafka's partitioning for joins, but
shuffle/partition data by itself. Flink is very fast at shuffling and adds
very little latency on shuffles, so that is usually not an issue. The
reason that design is that we view streaming program as something dynamic:
Kafka partitions may be added or removed during a program's life time, and
the parallelism (and with that the partitioning scheme) can change as well.
With Flink handling the partitioning internally, these cases are all
covered.

Concerning the join: The built-in join is definitely able to handle
millions or records in a window, and scales well. What it does is windowing
the streams together and joining within the windows. If you want responses
within a second, you should make the window small enough that it evaluated
every 500ms or so.

If you want super low latency joins, you can look into using custom state
to do that. With that, you could build your custom symmetric hash join for
example. That has virtually zero latency and you can control how long each
side keeps the data.

Concerning key lookups vs shuffling: The shuffle variant is usually much
faster, because it uses network better. The shuffle is fully pipelined,
many records are in shuffle at the same time, it is optimized for
throughput and can still keep the latency quite low (
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
).
In contrast, key lookups that avoid shuffling usually take a bit of time
(millisecond or so) and limit any throughput a lot because they involve
many smaller messages and add even more latency (roundtrip between nodes,
rather than one way).


Hope that this answers your question, let me know if you have more
questions!

Greetings,
Stephan


On Fri, Dec 11, 2015 at 4:00 PM, Alex Rovner <alex.rovner@magnetic.com>
wrote:

> Hello all,
>
> I was wondering if someone would be kind enough to enlighten me on a few
> topics. We are trying to join two streams of data on a key. We were
> thinking of partitioning topics in Kafka by the key, however I also saw
> that Flink is able to partition on its own and I was wondering whether
> Flink can take advantage of Kafka's partitioning and/or which partitioning
> scheme should I go for?
>
> As far as joins, our two datasets are very large (millions of records in
> each window) and we need to perform the joins very quickly (less than 1
> sec). Would the built in join mechanism be sufficient for this? From what I
> understand it would need to shuffle data and thus be slow on large
> datasets? I was wondering if there is a way to join via state key value
> lookups to avoid the shuffling?
>
> I read the docs and the blogs so far, thus have some limited understanding
> of how Flink works, no practical experience though.
>
> Thanks
>
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * <http://www.magnetic.com/>*
>

Mime
View raw message