Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <FDF14BF2-35A7-4CA0-B87C-F697BC08F444@data-artisans.com>
References: <CABJFqkycRYY4gj6y6D6OFoh-_1Uhk0TTDzax0Nx6-vuUAVjMQw@mail.gmail.com>
 <FDF14BF2-35A7-4CA0-B87C-F697BC08F444@data-artisans.com>
From: Ajay Krishna <ajaykrishna@gmail.com>
Date: Thu, 28 Sep 2017 08:17:56 -0700
Message-ID: <CABJFqkwQRzF2k2FnoAOuag3bf7bwrihcCgCHaME42RDUO8UR8Q@mail.gmail.com>
Subject: Re: Issue with CEP library
To: Kostas Kloudas <k.kloudas@data-artisans.com>
Cc: user@flink.apache.org
Content-Type: multipart/alternative; boundary="94eb2c1907aac576d4055a41697e"
archived-at: Thu, 28 Sep 2017 15:18:08 -0000

--94eb2c1907aac576d4055a41697e
Content-Type: text/plain; charset="UTF-8"

Hi Kostas,

Thank you for reaching out and for the suggestions. Here are the results

1. Using an env parallelism of 1 performed similar with the additional
problem that there was significant lag in the kafka topic
2. I removed the additional keyBy(0) but that did not change anything
3. I also tried only to check for the start only pattern and it was exactly
the same where I saw one of the homes going through but 3 others just
getting dropped.
4. I also tried slowing down the rate from 5000/second into Kafka to about
1000/second but I see similar results.

I was wondering if you had any other solutions to the problem. I am
specially concerned about 1 and 3. Is this library under active development
? Is there a JIRA open on this issue and could be open one to track this ?


I was trying read on Stackoverlfow and found a user had a very very similar
issue in Aug'16. So I also contacted him to discuss the issue and learn't
that the pattern of failure was exactly the same.

https://stackoverflow.com/questions/38870819/flink-cep-is-not-deterministic


Before I found the above post, I created a post for this issue
https://stackoverflow.com/questions/46458873/flink-cep-not-recognizing-pattern


I would really appreciate your guidance on this.

Best regards,
Ajay


On Thu, Sep 28, 2017 at 1:38 AM, Kostas Kloudas <k.kloudas@data-artisans.com
> wrote:

> Hi Ajay,
>
> I will look a bit more on the issue.
>
> But in the meantime, could you run your job with parallelism of 1, to see
> if the results are the expected?
>
> Also could you change the pattern, for example check only for the start,
> to see if all keys pass through.
>
> As for the code, you apply keyBy(0) the cepMap stream twice, which is
> redundant and introduces latency.
> You could remove that to also see the impact.
>
> Kostas
>
> On Sep 28, 2017, at 2:57 AM, Ajay Krishna <ajaykrishna@gmail.com> wrote:
>
> Hi,
>
> I've been only working with flink for the past 2 weeks on a project and am
> trying using the CEP library on sensor data. I am using flink version
> 1.3.2. Flink has a kafka source. I am using KafkaSource9. I am running
> Flink on a 3 node AWS cluster with 8G of RAM running Ubuntu 16.04. From the
> flink dashboard, I see that I have 2 Taskmanagers & 4 Task slots
>
> What I observe is the following. The input to Kafka is a json string and
> when parsed on the flink side, it looks like this
>
> (101,Sun Sep 24 23:18:53 UTC 2017,complex event,High,37.75142,-122.39458,12.0,20.0)
>
> I use a Tuple8 to capture the parsed data. The first field is home_id. The
> time characteristic is set to EventTime and I have an
> AscendingTimestampExtractor using the timestamp field. I have parallelism
> for the execution environment is set to 4. I have a rather simple event
> that I am trying to capture
>
> DataStream<Tuple8<Integer,Date,String,String,Float,Float,Float, Float>> cepMapByHomeId = cepMap.keyBy(0);
>
>             //cepMapByHomeId.print();
>
>             Pattern<Tuple8<Integer,Date,String,String,Float,Float,Float,Float>, ?> cep1 =
>                             Pattern.<Tuple8<Integer,Date,String,String,Float,Float,Float,Float>>begin("start")
>                                             .where(new OverLowThreshold())
>                                             .followedBy("end")
>                                             .where(new OverHighThreshold());
>
>
>             PatternStream<Tuple8<Integer, Date, String, String, Float, Float, Float, Float>> patternStream = CEP.pattern(cepMapByHomeId.keyBy(0), cep1);
>
>
>             DataStream<Tuple7<Integer, Date, Date, String, String, Float, Float>> alerts = patternStream.select(new PackageCapturedEvents());
>
> The pattern checks if the 7th field in the tuple8 goes over 12 and then
> over 16. The output of the pattern is like this
>
> (201,Tue Sep 26 14:56:09 UTC 2017,Tue Sep 26 15:11:59 UTC 2017,complex event,Non-event,37.75837,-122.41467)
>
> On the Kafka producer side, I am trying send simulated data for around 100
> homes, so the home_id would go from 0-100 and the input is keyed by
> home_id. I have about 10 partitions in kafka. The producer just loops going
> through a csv file with a delay of about 100 ms between 2 rows of the csv
> file. The data is exactly the same for all 100 of the csv files except for
> home_id and the lat & long information. The timestamp is incremented by a
> step of 1 sec. I start multiple processes to simulate data form different
> homes.
>
> THE PROBLEM:
>
> Flink completely misses capturing events for a large subset of the input
> data. I barely see the events for about 4-5 of the home_id values. I do a
> print before applying the pattern and after and I see all home_ids before
> and only a tiny subset after. Since the data is exactly the same, I expect
> all homeid to be captured and written to my sink which is cassandra in this
> case. I've looked through all available docs and examples but cannot seem
> to get a fix for the problem.
>
> I would really appreciate some guidance how to understand fix this.
>
>
> Thank you,
>
> Ajay
>
>
>

--94eb2c1907aac576d4055a41697e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Kostas,<div><br></div><div>Thank you for reaching out a=
nd for the suggestions. Here are the results</div><div><br></div><div>1. Us=
ing an env parallelism of 1 performed similar with the additional problem t=
hat there was significant lag in the kafka topic</div><div>2. I removed the=
 additional keyBy(0) but that did not change anything</div><div>3. I also t=
ried only to check for the start only pattern and it was exactly the same w=
here I saw one of the homes going through but 3 others just getting dropped=
.=C2=A0</div><div>4. I also tried slowing down the rate from 5000/second in=
to Kafka to about 1000/second but I see similar results.=C2=A0<br></div><di=
v><br></div><div>I was wondering if you had any other solutions to the prob=
lem. I am specially concerned about 1 and 3. Is this library under active d=
evelopment ? Is there a JIRA open on this issue and could be open one to tr=
ack this ?=C2=A0</div><div><br></div><div><br></div><div>I was trying read =
on Stackoverlfow and found a user had a very very similar issue in Aug&#39;=
16. So I also contacted him to discuss the issue and learn&#39;t that the p=
attern of failure was exactly the same.=C2=A0</div><div><br></div><div><a h=
ref=3D"https://stackoverflow.com/questions/38870819/flink-cep-is-not-determ=
inistic" target=3D"_blank">https://stackoverflow.com/<wbr>questions/3887081=
9/flink-cep-<wbr>is-not-deterministic</a><br></div><div><br></div><div><br>=
</div><div>Before I found the above post, I created a post for this issue</=
div><div><a href=3D"https://stackoverflow.com/questions/46458873/flink-cep-=
not-recognizing-pattern">https://stackoverflow.com/questions/46458873/flink=
-cep-not-recognizing-pattern</a><br></div><div><br></div><div><br></div><di=
v><br></div><div>I would really appreciate your guidance on this.=C2=A0</di=
v><div><br></div><div>Best regards,</div><div>Ajay</div><div><br></div><div=
><br></div><div><br></div><div><br></div></div><div class=3D"gmail_extra"><=
br><div class=3D"gmail_quote">On Thu, Sep 28, 2017 at 1:38 AM, Kostas Kloud=
as <span dir=3D"ltr">&lt;<a href=3D"mailto:k.kloudas@data-artisans.com" tar=
get=3D"_blank">k.kloudas@data-artisans.com</a>&gt;</span> wrote:<br><blockq=
uote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc =
solid;padding-left:1ex"><div style=3D"word-wrap:break-word">Hi Ajay,<div><b=
r></div><div>I will look a bit more on the issue.</div><div><br></div><div>=
But in the meantime, could you run your job with parallelism of 1, to see i=
f the results are the expected?</div><div><br></div><div>Also could you cha=
nge the pattern, for example check only for the start, to see if all keys p=
ass through.</div><div><br></div><div>As for the code, you apply keyBy(0) t=
he cepMap stream twice, which is redundant and introduces latency.=C2=A0</d=
iv><div>You could remove that to also see the impact.</div><span class=3D"H=
OEnZb"><font color=3D"#888888"><div><br></div><div>Kostas</div></font></spa=
n><div><div class=3D"h5"><div><br></div><div><div><blockquote type=3D"cite"=
><div>On Sep 28, 2017, at 2:57 AM, Ajay Krishna &lt;<a href=3D"mailto:ajayk=
rishna@gmail.com" target=3D"_blank">ajaykrishna@gmail.com</a>&gt; wrote:</d=
iv><br class=3D"m_928891538866746917Apple-interchange-newline"><div><div di=
r=3D"ltr">Hi,=C2=A0<div><br></div><div><div class=3D"m_928891538866746917gm=
ail-post-text"><p>I&#39;ve been only working with flink for the past 2 week=
s on a project=20
and am trying using the CEP library on sensor data. I am using flink=20
version 1.3.2. Flink has a kafka source. I am using KafkaSource9. I am=20
running Flink on a 3 node AWS cluster with 8G of RAM running Ubuntu=20
16.04. From the flink dashboard, I see that I have 2 Taskmanagers &amp; 4
 Task slots</p><p>What I observe is the following. The input to Kafka is a =
json string and when parsed on the flink side, it looks like this</p>

<pre><code>(101,Sun Sep 24 23:18:53 UTC 2017,complex event,High,37.75142,-1=
22.<wbr>39458,12.0,20.0)
</code></pre><p>I use a Tuple8 to capture the parsed data. The first field =
is=20
home_id. The time characteristic is set to EventTime and I have an=20
AscendingTimestampExtractor using the timestamp field. I have=20
parallelism for the execution environment is set to 4. I have a rather=20
simple event that I am trying to capture</p>

<pre><code>DataStream&lt;Tuple8&lt;Integer,<wbr>Date,String,String,Float,<w=
br>Float,Float, Float&gt;&gt; cepMapByHomeId =3D cepMap.keyBy(0);

            //cepMapByHomeId.print();

            Pattern&lt;Tuple8&lt;Integer,Date,<wbr>String,String,Float,Floa=
t,<wbr>Float,Float&gt;, ?&gt; cep1 =3D
                            Pattern.&lt;Tuple8&lt;Integer,Date,<wbr>String,=
String,Float,Float,<wbr>Float,Float&gt;&gt;begin(&quot;start&quot;)
                                            .where(new OverLowThreshold())
                                            .followedBy(&quot;end&quot;)
                                            .where(new OverHighThreshold())=
;


            PatternStream&lt;Tuple8&lt;Integer, Date, String, String, Float=
, Float, Float, Float&gt;&gt; patternStream =3D CEP.pattern(cepMapByHomeId.=
<wbr>keyBy(0), cep1);


            DataStream&lt;Tuple7&lt;Integer, Date, Date, String, String, Fl=
oat, Float&gt;&gt; alerts =3D patternStream.select(new PackageCapturedEvent=
s());
</code></pre><p>The pattern checks if the 7th field in the tuple8 goes over=
 12 and then over 16. The output of the pattern is like this</p>

<pre><code>(201,Tue Sep 26 14:56:09 UTC 2017,Tue Sep 26 15:11:59 UTC 2017,c=
omplex event,Non-event,37.75837,-122.<wbr>41467)
</code></pre><p>On the Kafka producer side, I am trying send simulated data=
 for=20
around 100 homes, so the home_id would go from 0-100 and the input is=20
keyed by home_id. I have about 10 partitions in kafka. The producer just
 loops going through a csv file with a delay of about 100 ms between 2=20
rows of the csv file. The data is exactly the same for all 100 of the=20
csv files except for home_id and the lat &amp; long information. The=20
timestamp is incremented by a step of 1 sec. I start multiple processes=20
to simulate data form different homes.</p><p>THE PROBLEM:</p><p>Flink compl=
etely misses capturing events for a large subset of the=20
input data. I barely see the events for about 4-5 of the home_id values.
 I do a print before applying the pattern and after and I see all=20
home_ids before and only a tiny subset after. Since the data is exactly=20
the same, I expect all homeid to be captured and written to my sink=20
which is cassandra in this case. I&#39;ve looked through all available docs=
=20
and examples but cannot seem to get a fix for the problem. </p><p>I would r=
eally appreciate some guidance how to understand fix this.</p><p><br></p><p=
>Thank you,</p><p>Ajay</p></div></div></div>
</div></blockquote></div><br></div></div></div></div></blockquote></div><br=
></div>

--94eb2c1907aac576d4055a41697e--