Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <CAAdrtT07So8vLx=eV=9gxwQeje1FuDqHhO3D1RREXZL10WkgVQ@mail.gmail.com>
References: <CACDjLVhAcCWop9J6B-3wen8wNk8n=qgUrzjh8VMKJRtPFtAqqA@mail.gmail.com>
 <CAAdrtT07So8vLx=eV=9gxwQeje1FuDqHhO3D1RREXZL10WkgVQ@mail.gmail.com>
From: Simone Robutti <simone.robutti@radicalbit.io>
Date: Fri, 30 Sep 2016 11:00:07 +0200
Message-ID: <CACDjLVjfE-7gHC-Fqf7Ov=LRw8JAisbodn_9Qq1No3TpUkQFyA@mail.gmail.com>
Subject: Re: Counting latest state of stateful entities in streaming
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a114019662d1f38053db5d2ee
archived-at: Fri, 30 Sep 2016 09:00:19 -0000

--001a114019662d1f38053db5d2ee
Content-Type: text/plain; charset=UTF-8

I'm working with your suggestions, thank you very much. What I'm missing
here is what YourWindowFunction should do. I have no notion of event time
there and so I can't assign a timestamp. Also this solution seems to be
working by processing time, while I care about event time. I couldn't make
it run yet but for what I got, this is slightly different from what I need.

2016-09-30 10:04 GMT+02:00 Fabian Hueske <fhueske@gmail.com>:

> Hi Simone,
>
> I think I have a solution for your problem:
>
> val s: DataStream[(Long, Int, ts)] = ??? // (id, state, time)
>
> val stateChanges: DataStream[(Int, Int)] = s // (state, cntUpdate)
>   .keyBy(_._1) // key by id
>   .flatMap(new StateUpdater) // StateUpdater is a stateful
> FlatMapFunction. It has a keyed state that stores the last state of each
> id. For each input record it returns two records: (oldState, -1),
> (newState, +1)
>
> stateChanges ensures that counts of previous states are subtracted.
>
> val changesPerWindow: DataStream[(Int, Int, Long)] = stateChanges //
> (state, cntUpdate, time)
>   .keyBy(_._1) // key by state
>   .window() // your window, should be non-overlapping, so go for instance
> for Tumbling
>   .apply(new SumReducer(), new YourWindowFunction()) // SumReducer sums
> the cntUpdates and YourWindowFunction assigns the timestamp of your window
>
> this step aggregates all state changes for each state in a window
>
> val stateCnts: DataStream[(Int, Int, Long)] = stateCnts (state, count,
> time)
>   .keyBy(_._1) // key by state again
>   .map(new CountUpdater) // CountUpdater is a stateful MapFunction. I has
> a keyed state that stores the current count. For each incoming record, the
> count is adjusted and a record (state, newCount, time) is emitted.
>
> Now you have the new counts for your states in multiple records. If
> possible, you can update your Elasticsearch index using these. Otherwise,
> you have to collect them into one record using another window.
>
> Also note, that the state size of this program depends on the number of
> unique ids. That might cause problems if the id space grows very fast.
>
> Please let me know, if you have questions or if that works ;-)
>
> Cheers, Fabian
>
>
> 2016-09-30 0:32 GMT+02:00 Simone Robutti <simone.robutti@radicalbit.io>:
>
>> Hello,
>>
>> in the last few days I tried to create my first real-time analytics job
>> in Flink. The approach is kappa-architecture-like, so I have my raw data on
>> Kafka where we receive a message for every change of state of any entity.
>>
>> So the messages are of the form
>>
>> (id,newStatus, timestamp)
>>
>> We want to compute, for every time window, the count of items in a given
>> status. So the output should be of the form
>>
>> (outputTimestamp, state1:count1,state2:count2 ...)
>>
>> or equivalent. These rows should contain, at any given time, the count of
>> the items in a given status, where the status associated to an Id is the
>> most recent message observed for that id. The status for an id should be
>> counted in any case, even if the event is way older than those getting
>> processed. So the sum of all the counts should be equal to the number of
>> different IDs observed in the system. The following step could be
>> forgetting about the items in a final item after a while, but this is not a
>> strict requirement right now.
>>
>> This will be written on elasticsearch and then queried.
>>
>> I tried many different paths and none of them completely satisfied the
>> requirement. Using a sliding window I could easily achieve the expected
>> behaviour, except that when the beginning of the sliding window surpassed
>> the timestamp of an event, it was lost for the count, as you may expect.
>> Others approaches failed to be consistent when working with a backlog
>> because I did some tricks with keys and timestamps that failed when the
>> data was processed all at once.
>>
>> So I would like to know, even at an high level, how should I approach
>> this problem. It looks like a relatively common use-case but the fact that
>> the relevant information for a given ID must be retained indefinitely to
>> count the entities correctly creates a lot of problems.
>>
>> Thank you in advance,
>>
>> Simone
>>
>>
>

--001a114019662d1f38053db5d2ee
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>I&#39;m working with your suggestions, thank you very=
 much. What I&#39;m missing here is what YourWindowFunction should do. I ha=
ve no notion of event time there and so I can&#39;t assign a timestamp. Als=
o this solution seems to be working by processing time, while I care about =
event time. I couldn&#39;t make it run yet but for what I got, this is slig=
htly different from what I need.<br></div></div><div class=3D"gmail_extra">=
<br><div class=3D"gmail_quote">2016-09-30 10:04 GMT+02:00 Fabian Hueske <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:fhueske@gmail.com" target=3D"_blank">f=
hueske@gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr"><div><div><div><div><div><div><div><div><div><div><div><div><div><=
div><div>Hi Simone,<br><br></div>I think I have a solution for your problem=
:<br><br></div>val s: DataStream[(Long, Int, ts)] =3D ??? // (id, state, ti=
me)<br><br></div>val stateChanges: DataStream[(Int, Int)] =3D s // (state, =
cntUpdate)<br>=C2=A0 .keyBy(_._1) // key by id<br></div>=C2=A0 .flatMap(new=
 StateUpdater) // StateUpdater is a stateful FlatMapFunction. It has a keye=
d state that stores the last state of each id. For each input record it ret=
urns two records: (oldState, -1), (newState, +1)<br><br></div><div>stateCha=
nges ensures that counts of previous states are subtracted.<br></div><div><=
br></div>val changesPerWindow: DataStream[(Int, Int, Long)] =3D stateChange=
s // (state, cntUpdate, time)<br></div>=C2=A0 .keyBy(_._1) // key by state<=
br></div>=C2=A0 .window() // your window, should be non-overlapping, so go =
for instance for Tumbling<br></div>=C2=A0 .apply(new SumReducer(), new Your=
WindowFunction()) // SumReducer sums the cntUpdates and YourWindowFunction =
assigns the timestamp of your window<br><br></div>this step aggregates all =
state changes for each state in a window<br><br></div>val stateCnts: DataSt=
ream[(Int, Int, Long)] =3D stateCnts (state, count, time)<br></div>=C2=A0 .=
keyBy(_._1) // key by state again<br></div>=C2=A0 .map(new CountUpdater) //=
 CountUpdater is a stateful MapFunction. I has a keyed state that stores th=
e current count. For each incoming record, the count is adjusted and a reco=
rd (state, newCount, time) is emitted.<br><br></div>Now you have the new co=
unts for your states in multiple records. If possible, you can update your =
Elasticsearch index using these. Otherwise, you have to collect them into o=
ne record using another window.<br><br></div><div>Also note, that the state=
 size of this program depends on the number of unique ids. That might cause=
 problems if the id space grows very fast.<br></div><div><br></div>Please l=
et me know, if you have questions or if that works ;-)<br><br></div>Cheers,=
 Fabian<div><div class=3D"h5"><br><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">2016-09-30 0:32 GMT+02:00 Simone Robutti <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:simone.robutti@radicalbit.io" target=3D"_blank">simo=
ne.robutti@radicalbit.io</a>&gt;</span><wbr>:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex"><div dir=3D"ltr">Hello,=C2=A0<div><br></div><div>in the last few days =
I tried to create my first real-time analytics job in Flink. The approach i=
s kappa-architecture-like, so I have my raw data on Kafka where we receive =
a message for every change of state of any entity.</div><div><br></div><div=
>So the messages are of the form=C2=A0</div><div><br></div><div>(id,newStat=
us, timestamp)</div><div><br></div><div>We want to compute, for every time =
window, the count of items in a given status. So the output should be of th=
e form=C2=A0</div><div><br></div><div>(outputTimestamp, state1:count1,state=
2:count2 ...)</div><div><br></div><div>or equivalent. These rows should con=
tain, at any given time, the count of the items in a given status, where th=
e status associated to an Id is the most recent message observed for that i=
d. The status for an id should be counted in any case, even if the event is=
 way older than those getting processed. So the sum of all the counts shoul=
d be equal to the number of different IDs observed in the system. The follo=
wing step could be forgetting about the items in a final item after a while=
, but this is not a strict requirement right now.</div><div><br></div><div>=
This will be written on elasticsearch and then queried.</div><div><br></div=
><div>I tried many different paths and none of them completely satisfied th=
e requirement. Using a sliding window I could easily achieve the expected b=
ehaviour, except that when the beginning of the sliding window surpassed th=
e timestamp of an event, it was lost for the count, as you may expect. Othe=
rs approaches failed to be consistent when working with a backlog because I=
 did some tricks with keys and timestamps that failed when the data was pro=
cessed all at once.</div><div><br></div><div>So I would like to know, even =
at an high level, how should I approach this problem. It looks like a relat=
ively common use-case but the fact that the relevant information for a give=
n ID must be retained indefinitely to count the entities correctly creates =
a lot of problems.</div><div><br></div><div>Thank you in advance,</div><div=
><br></div><div>Simone=C2=A0</div><div><br></div></div>
</blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>

--001a114019662d1f38053db5d2ee--