Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <254b025d-d990-de5c-e215-d836f1c1dd4c@gmail.com>
References: <254b025d-d990-de5c-e215-d836f1c1dd4c@gmail.com>
From: Fabian Hueske <fhueske@gmail.com>
Date: Wed, 18 Jan 2017 17:22:14 +0100
Message-ID: <CAAdrtT3zUdW=UP=4JoOrMevjEAmcRFj_ttBFFPAdVUhvqwr9bw@mail.gmail.com>
Subject: Re: Window limitations on groupBy
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=f403045cf616a038d5054660d3dc
archived-at: Wed, 18 Jan 2017 16:22:50 -0000

--f403045cf616a038d5054660d3dc
Content-Type: text/plain; charset=UTF-8

Hi Raman,

I would approach this issues as follows.

You key the input stream on the sourceId and apply a stateful
FlatMapFunction.
The FlatMapFunction has a key-partioned state and stores for each key
(sourceId) the latest event as state.
When a new event arrives, you can compute the time spend in the last state
by looking up the event from the state and the latest received event.
Then you put the new event in the state.

This solution works well if you have a finite number of sources or if you
have an terminal event that signals that no more events will arrive for a
key.
Otherwise, the number of events stored in the state will grow infinitely
and eventually become a problem.

If the  number of sources increases, you need to evict data at some point
in time. A ProcessFunction can help here, because you can register a timer
which
you can use to evict up old state.

Hope this helps,
Fabian

2017-01-18 15:39 GMT+01:00 Raman Gupta <rocketraman@gmail.com>:

> I am investigating Flink. I am considering a relatively simple use
> case -- I want to ingest streams of events that are essentially
> timestamped state changes. These events may look something like:
>
> {
>   sourceId: 111,
>   state: OPEN,
>   timestamp: <date/time>
> }
>
> I want to apply various processing to these state change events, the
> output of which can be used for analytics. For example:
>
> 1. average time spent in state, by state
> 2. sources with longest (or shortest) time spent in OPEN state
>
> The time spent in each state may be days or even weeks.
>
> All the examples I have seen of similar logic involve windows on the
> order of 15 minutes. Since time spent in each state may far exceed
> these window sizes, I'm wondering what the best approach will be.
>
> One thought from reading the docs is to use `every` to operate on the
> entire stream. But it seems like this will take longer and longer to
> run as the event stream grows, so this is not an ideal solution. Or
> does Flink apply some clever optimizations to avoid the potential
> performance issue?
>
> Another thought was to split the event stream into multiple streams by
> source, each of which will have a small (and limited) amount of data.
> This will make processing each stream simpler, but since there can be
> thousands of sources, it will result in a lot of streams to handle and
> persist (probably in Kafka). This does not seem ideal either.
>
> It seems like this should be simple, but I'm struggling with
> understanding how to solve it elegantly.
>
> Regards,
> Raman
>
>

--f403045cf616a038d5054660d3dc
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div><div><div>Hi Raman,<br><br></div>I wou=
ld approach this issues as follows.<br><br></div>You key the input stream o=
n the sourceId and apply a stateful FlatMapFunction.<br></div>The FlatMapFu=
nction has a key-partioned state and stores for each key (sourceId) the lat=
est event as state.<br></div>When a new event arrives, you can compute the =
time spend in the last state by looking up the event from the state and the=
 latest received event.<br></div>Then you put the new event in the state.<b=
r><br></div><div>This solution works well if you have a finite number of so=
urces or if you have an terminal event that signals that no more events wil=
l arrive for a key.<br>Otherwise, the number of events stored in the state =
will grow infinitely and eventually become a problem.<br><br>If the=C2=A0 n=
umber of sources increases, you need to evict data at some point in time. A=
 ProcessFunction can help here, because you can register a timer which <br>=
you can use to evict up old state.<br><br></div><div>Hope this helps,<br></=
div><div>Fabian<br></div></div><div class=3D"gmail_extra"><br><div class=3D=
"gmail_quote">2017-01-18 15:39 GMT+01:00 Raman Gupta <span dir=3D"ltr">&lt;=
<a href=3D"mailto:rocketraman@gmail.com" target=3D"_blank">rocketraman@gmai=
l.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:=
0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I am investigating =
Flink. I am considering a relatively simple use<br>
case -- I want to ingest streams of events that are essentially<br>
timestamped state changes. These events may look something like:<br>
<br>
{<br>
=C2=A0 sourceId: 111,<br>
=C2=A0 state: OPEN,<br>
=C2=A0 timestamp: &lt;date/time&gt;<br>
}<br>
<br>
I want to apply various processing to these state change events, the<br>
output of which can be used for analytics. For example:<br>
<br>
1. average time spent in state, by state<br>
2. sources with longest (or shortest) time spent in OPEN state<br>
<br>
The time spent in each state may be days or even weeks.<br>
<br>
All the examples I have seen of similar logic involve windows on the<br>
order of 15 minutes. Since time spent in each state may far exceed<br>
these window sizes, I&#39;m wondering what the best approach will be.<br>
<br>
One thought from reading the docs is to use `every` to operate on the<br>
entire stream. But it seems like this will take longer and longer to<br>
run as the event stream grows, so this is not an ideal solution. Or<br>
does Flink apply some clever optimizations to avoid the potential<br>
performance issue?<br>
<br>
Another thought was to split the event stream into multiple streams by<br>
source, each of which will have a small (and limited) amount of data.<br>
This will make processing each stream simpler, but since there can be<br>
thousands of sources, it will result in a lot of streams to handle and<br>
persist (probably in Kafka). This does not seem ideal either.<br>
<br>
It seems like this should be simple, but I&#39;m struggling with<br>
understanding how to solve it elegantly.<br>
<br>
Regards,<br>
Raman<br>
<br>
</blockquote></div><br></div>

--f403045cf616a038d5054660d3dc--