flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bowen Li <bowen...@offerupnow.com>
Subject Re: Even out the number of generated windows
Date Sat, 26 Aug 2017 02:11:52 GMT
Hi Robert,
We use kinesis sink (FlinkKinesisProducer). The main pain is the Kinesis
Producer Library (KPL) that FlinkKinesisProducer uses.

KPL is basically a java wrapper with a c++ core. It's slow, unstable, easy
to crash, memory-and-CPU-consuming (it sends traffic via HTTP), and can't
handle high workload like a few million records at a short period of time.
Well, in order to write to Kinesis, there's no other options except KPL
(AWS Kinesis SDK is even slower), so I'm not blaming Flink chose KPL.

Are there any recommended ways to "artificially throttle down the stream
before the sink"? How to add the throttling into Flink's fluent API?

Thanks,
Bowen


On Fri, Aug 25, 2017 at 2:31 PM, Robert Metzger <rmetzger@apache.org> wrote:

> Hi Bowen,
>
> (very nice graphics :) )
>
> I don't think you can do anything about the windows itself (unless you are
> able to build the windows yourself using the ProcessFunction, playing some
> tricks because you know your data), so I should focus on reducing the pain
> in "burning down your sink".
> Are there any issues with the Sink by the spikes? (What's the downstream
> system?)
> Does it make sense for you to artificially throttle down the stream before
> the sink, so that the records per second get limited to a certain rate.
> Since you are using Event time, the window results will always be correct &
> consistent. From a business perspective, this will of course introduce
> additional latency (= results come in later).
>
>
> On Fri, Aug 25, 2017 at 6:23 AM, Bowen Li <bowen.li@offerupnow.com> wrote:
>
>> Hi guys,
>>
>> I do have a question for how Flink generates windows.
>>
>> We are using a 1-day sized sliding window with 1-hour slide to count some
>> features of items based on event time. We have about 20million items. We
>> observed that Flink only emit results on a fixed time in an hour (e.g. 1am,
>> 2am, 3am,  or 1:15am, 2:15am, 3:15am with a 15min offset). That's means
>> 20million windows/records are generated at the same time every hour, which
>> burns down our sink. But nothing is generated in the rest of that hour. The
>> pattern is like this:
>>
>> # generated windows
>> |
>> |    /\                  /\
>> |   /  \                /  \
>> |_/__\_______/__\_
>>                                  time
>>
>> Is there any way to even out the number of generated windows/records in
>> an hour? Can we have evenly distributed generated load like this?
>>
>> # generated windows
>> |
>> |
>> | ------------------------
>> |_______________
>>                                  time
>>
>> Thanks,
>> Bowen
>>
>>
>

Mime
View raw message