flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: Flink the right tool for the job ? Huge Data window lateness
Date Fri, 24 Feb 2017 16:47:54 GMT
Hi,
sounds like a cool project.

What's the size of one data point?
If one datapoint is 2 kb, you'll have 100 800 000 * 2048 bytes = 206
gigabytes of state. That's something one or two machines (depending on the
disk throughput) should be able to handle.

If possible, I would recommend you to do an experiment using a prototype to
see how many machines you need for your workload.

On Fri, Feb 24, 2017 at 5:41 PM, Tzu-Li (Gordon) Tai <tzulitai@apache.org>
wrote:

> Hi Patrick,
>
> Thanks a lot for feedback on your use case! At a first glance, I would say
> that Flink can definitely solve the issues you are evaluating.
>
> I’ll try to explain them, and point you to some docs / articles that can
> further explain in detail:
>
> *- Lateness*
>
> The 7-day lateness shouldn’t be a problem. We definitely recommend
> using RocksDB as the state backend for such a use case, as you
> mentioned correctly, the state would be kept for a long time.
> The heavy burst when your locally buffered data on machines are
> sent to Kafka once they come back online shouldn’t be a problem either;
> since Flink is a pure data streaming engine, it handles backpressure
> naturally without any additional mechanisms (I would recommend
> taking a look at http://data-artisans.com/how-flink-handles-backpressure/
> ).
>
> *- Out of Order*
>
> That’s exactly what event time processing is for :-) As long as the event
> comes in before the allowed lateness for windows, the event will still fall
> into its corresponding event time window. So, even with the heavy burst of
> the your late machine data, they will still be aggregated in the correct
> windows.
> You can look into event time in Flink with more detail in the event time
> docs:
> https://ci.apache.org/projects/flink/flink-docs-
> release-1.3/dev/event_time.html
>
> *- Last write wins*
>
> Your operators that does the aggregations simply need to be able to
> reprocess
> results if it sees an event with the same id come in. Now, if results are
> sent out
> of Flink and stored in an external db, if you can design the db writes to
> be idempotent,
> then it’ll effectively be a “last write wins”. It depends mostly on your
> pipeline and
> use case.
>
> *- Computations per minute*
> I think you can simply do this by having two separate window operators.
> One that works on your longer window, and another on a per-minute basis.
>
> Hope this helps!
>
> - Gordon
>
> On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr (jay@kpibench.com)
> wrote:
>
> Hello
>
> I've done my first steps with Flink and i am very impressed of its
> capabilities. Thank you for that :) I want to use it for a project we are
> currently working on. After reading some documentation
> i am not sure if it's the right tool for the job. We have an IoT
> application in which we are monitoring machines in production plants. The
> machines have sensors attached and they are sending
> their data to a broker ( Kafka, Azure Iot Hub ) currently on a per minute
> basis.
>
> Following requirements must be fulfilled
>
>
>    - Lateness
>
>    We have to allow lateness for 7 days because machines can have down
>    time due network issues, maintenance or something else. If thats the case
>    buffering of data happens localy on the machine and once they
>    are online again all data will be sent to the broker. This can result
>    in some relly heavy burst.
>
>
>    - Out of order
>
>    Events come out of order due this lateness issues
>
>
>    - Last write wins
>
>    Machines are not stateful and can not guarantee exactly once sending
>    of their data. It can happen that sometimes events are sent twice. In that
>    case the last event wins and should override the previous one.
>    Events are unique due a sensor_id and a timestamp
>
>    - Computations per minute
>
>    We can not wait until the windows ends and have to do computations on
>    a per minute basis. For example aggregating data per sensor and writing it
>    to a db
>
>
> My biggest concern in that case is the huge lateness. Keeping data for 7
> days would result in 10080 data points for just one sensor! Multiplying
> that by 10.000 sensors would result in 100800000 datapoints which Flink
> would have to handle in its state. The number of sensors are constantly
> growing so will the number of data points
>
> So my questions are
>
>
>    - Is Flink the right tool for the Job ?
>
>    - Is that lateness an issue ?
>
>    - How can i implement the Last write wins ?
>
>    - How to tune flink to handle that growing load of sensors and data
>    points ?
>
>    - Hardware requirements, storage and memory size ?
>
>
>
> I don't want to maintain two code base for batch and streaming because the
> operations are all equal. The only difference is the time range! Thats the
> reason i wanted to do all this with Flink Streaming.
>
> Hope you can guide me in the right direction
>
> Thx
>
>
>
>
>
>
>
>

Mime
View raw message