flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yukun Guo <gyk....@gmail.com>
Subject Hourly top-k statistics of DataStream
Date Tue, 07 Jun 2016 02:26:25 GMT

I'm working on a project which uses Flink to compute hourly log statistics
like top-K. The logs are fetched from Kafka by a FlinkKafkaProducer and
into a DataStream.

The problem is, I find the computation quite challenging to express with
Flink's DataStream API:

1. If I use something like `logs.timeWindow(Time.hours(1))`, suppose that
data volume is really high, e.g., billions of logs might be generated in one
hour, will the window grow too large and can't be handled efficiently?

2. We have to create a `KeyedStream` before applying `timeWindow`. However,
the distribution of some keys are skewed hence using them may compromise
the performance due to unbalanced partition loads. (What I want is just
rebalance the stream across all partitions.)

3. The top-K algorithm can be straightforwardly implemented with `DataSet`'s
`mapPartition` and `reduceGroup` API as in
[FLINK-2549](https://github.com/apache/flink/pull/1161/), but not so easy if
taking the DataStream approach, even with the stateful operators. I still
cannot figure out how to reunion streams once they are partitioned.

4. Is it possible to convert a DataStream into a DataSet? If yes, how can I
make Flink analyze the data incrementally rather than aggregating the logs
one hour before starting to process?

View raw message