From user-return-29415-archive-asf-public=cust-asf.ponee.io@flink.apache.org Fri Aug 23 12:23:00 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id B66C4180637 for ; Fri, 23 Aug 2019 14:22:59 +0200 (CEST) Received: (qmail 53775 invoked by uid 500); 23 Aug 2019 12:22:58 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 53765 invoked by uid 99); 23 Aug 2019 12:22:58 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Aug 2019 12:22:58 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id ACADD182A8B for ; Fri, 23 Aug 2019 12:22:57 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2 X-Spam-Level: ** X-Spam-Status: No, score=2 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_SHORT=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 98e2yqfQFPRx for ; Fri, 23 Aug 2019 12:22:53 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=5.57.41.100; helo=email-login.eu; envelope-from=theo.diefenthal@scoop-software.de; receiver= Received: from email-login.eu (email-login.eu [5.57.41.100]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 6822B7DD30 for ; Fri, 23 Aug 2019 12:22:53 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by email-login.eu (Postfix) with ESMTP id 947D11C16F1 for ; Fri, 23 Aug 2019 14:22:44 +0200 (CEST) Received: from email-login.eu ([127.0.0.1]) by localhost (email-login.eu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id ggQT362hYeRe for ; Fri, 23 Aug 2019 14:22:44 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by email-login.eu (Postfix) with ESMTP id 787101C31D3 for ; Fri, 23 Aug 2019 14:22:44 +0200 (CEST) X-Virus-Scanned: amavisd-new at email-login.eu Received: from email-login.eu ([127.0.0.1]) by localhost (email-login.eu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id jAj3rcxAmfoi for ; Fri, 23 Aug 2019 14:22:44 +0200 (CEST) Received: from email-login.eu (zcs-mbx1.email-login.eu [5.57.41.100]) by email-login.eu (Postfix) with ESMTP id 4DD781C16F1 for ; Fri, 23 Aug 2019 14:22:44 +0200 (CEST) Date: Fri, 23 Aug 2019 14:22:44 +0200 (CEST) From: Theo Diefenthal To: user Message-ID: <815403771.8226010.1566562964213.JavaMail.zimbra@scoop-software.de> In-Reply-To: References: <273611979.3859255.1565621335885.JavaMail.zimbra@scoop-software.de> Subject: Re: I'm not able to make a stream-stream Time windows JOIN in Flink SQL MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_dee9df68-bfd3-4790-b210-b6f1d0ef49a0" X-Mailer: Zimbra 8.8.15_GA_3829 (ZimbraWebClient - FF68 (Win)/8.8.15_GA_3829) Thread-Topic: I'm not able to make a stream-stream Time windows JOIN in Flink SQL Thread-Index: UOwDPNKjdAuutPzu3dFwIQEWU/fKJA== --=_dee9df68-bfd3-4790-b210-b6f1d0ef49a0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Hi Fabian, Hi Zhenghua Thank you for your suggestions and telling me that I was on the right track. And good to know how to find out whether something yields to time-bounded or regular join. @Fabian: Regarding your suggested first option: Isn't that exactly what my first try was? With this TUMBLE_START... That sadly didn't work due to " Rowtime attributes must not be in the input rows of a regular join ". But I'll give option 2 a try by just adding another attribute. And some addition: Regarding my second try: I wrote that the reduced query didn't produce any data, but that was indeed my mistake. I fiddled around too much with my data so that I manipulated the original data in a way that the query couldn't output a result any more when testing all of those combinations. Now the second attempt works but isn't really what I wanted to query (as the "same day"-predicate is still missing). Best regards Theo Von: "Fabian Hueske" An: "Zhenghua Gao" CC: "Theo Diefenthal" , "user" Gesendet: Freitag, 16. August 2019 10:05:45 Betreff: Re: I'm not able to make a stream-stream Time windows JOIN in Flink SQL Hi Theo, The main problem is that the semantics of your join (Join all events that happened on the same day) are not well-supported by Flink yet. In terms of true streaming joins, Flink supports the time-windowed join (with the BETWEEN predicate) and the time-versioned table join (which does not apply here). The first does not really fit because it puts the windows "around the event", i.e., if you have an event at 12:35 and a window of 10 mins earlier and 15 mins later, it will join with events between 12:25 and 12:50. An other limitation of Flink is that you cannot modify event-time attributes (well you can, but they lose their event-time property and become regular TIMESTAMP attributes). This limitation exists, because we must ensure that the attributes are still aligned with watermarks after they were modified (or adjusting the watermarks accordingly). Since analyzing expressions that modify timestamps to figure out whether they preserve watermark alignment is very difficult, we opted to always remove event-time property when an event-time attribute is modified. I see two options for your use case: 1) use the join that you described before with the -24 and +24 hour window and apply more fine-grained predicates to filter out the join results that you don't need. 2) add an additional time attribute to your input that is a rounded down version of the timestamp (rounded to 24h), declare the rounded timestamp as your event-time attribute, and join with an equality predicate on the rounded timestamp. Best, Fabian Am Di., 13. Aug. 2019 um 13:41 Uhr schrieb Zhenghua Gao < [ mailto:docete@gmail.com | docete@gmail.com ] >: I wrote a demo example for time windowed join which you can pick up [1] [1] [ https://gist.github.com/docete/8e78ff8b5d0df69f60dda547780101f1 | https://gist.github.com/docete/8e78ff8b5d0df69f60dda547780101f1 ] Best Regards, Zhenghua Gao On Tue, Aug 13, 2019 at 4:13 PM Zhenghua Gao < [ mailto:docete@gmail.com | docete@gmail.com ] > wrote: BQ_BEGIN You can check the plan after optimize to verify it's a regular join or time-bounded join(Should have a WindowJoin). The most direct way is breakpoint at optimizing phase [1][2]. And you can use your TestData and create an ITCase for debugging [3] [1] [ https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/main/scala/org/apache/flink/table/planner/delegation/PlannerBase.scala#L148 | https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/main/scala/org/apache/flink/table/planner/delegation/PlannerBase.scala#L148 ] [2] [ https://github.com/apache/flink/blob/master/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/plan/StreamOptimizer.scala#L68 | https://github.com/apache/flink/blob/master/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/plan/StreamOptimizer.scala#L68 ] [3] [ https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/planner/runtime/stream/sql/WindowJoinITCase.scala | https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/planner/runtime/stream/sql/WindowJoinITCase.scala ] Best Regards, Zhenghua Gao On Mon, Aug 12, 2019 at 10:49 PM Theo Diefenthal < [ mailto:theo.diefenthal@scoop-software.de | theo.diefenthal@scoop-software.de ] > wrote: BQ_BEGIN Hi there, Currently, I'm trying to write a SQL query which shall executed a time windowed/bounded JOIN on two data streams. Suppose I have stream1 with attribute id, ts, user and stream2 with attribute id, ts, userName. I want to receive the natural JOIN of both streams with events of the same day. In Oracle (With a ts column as number instead of Timestamp, for historical reasons), I do the following: SELECT * FROM STREAM1 JOIN STREAM2 ON STREAM1. "user" = STREAM2. "userName" AND TRUNC ( TO_DATE ( '19700101' , 'YYYYMMDD' ) + ( 1 / 24 / 60 / 60 / 1000 ) * STREAM1. "ts" ) = TRUNC ( TO_DATE ( '19700101' , 'YYYYMMDD' ) + ( 1 / 24 / 60 / 60 / 1000 ) * STREAM2. "ts" ); which yields 294 rows with my test data (14 elements from stream1 match to 21 elements in stream2 on the one day of test data). Now I want to query the same in Flink. So I registered both streams as table and properly registered the even-time (by specifying ts.rowtime as table column). My goal is to produce a time-windowed JOIN so that, if both streams advance their watermark far enough, an element is written out into an append only stream. First try (to conform time-bounded-JOIN conditions): SELECT [ http://s1.id/ | s1.id ] , [ http://s2.id/ | s2.id ] FROM STREAM1 AS s1 JOIN STREAM2 AS s2 ON s1.`user` = s2.userName AND s1.ts BETWEEN s2.ts - INTERVAL '24' HOUR AND s2.ts + INTERVAL '24' HOUR AND s2.ts BETWEEN s1.ts - INTERVAL '24' HOUR AND s1.ts + INTERVAL '24' HOUR AND TUMBLE_START(s1.ts, INTERVAL '1' DAY ) = TUMBLE_START(s2.ts, INTERVAL '1' DAY ) -- Reduce to matchings on the same day. This yielded in the exception "Rowtime attributes must not be in the input rows of a regular join. As a workaround you can cast the time attributes of input tables to TIMESTAMP before.". So I'm still in the area of regular joins, not time-windowed JOINs, even though I made the explicit BETWEEN for both input streams! Then I found [1], which really is my query but without the last condition (reduce to matching on the same day). I tried this one as well, just to have a starting point, but the error is the same. I then reduced the Condition to just one time bound: SELECT [ http://s1.id/ | s1.id ] , [ http://s2.id/ | s2.id ] FROM STREAM1 AS s1 JOIN STREAM2 AS s2 ON s1.`user` = s2.userName AND s1.ts BETWEEN s2.ts - INTERVAL '24' HOUR AND s2.ts + INTERVAL '24' HOUR which runs as a query but doesn't produce any results. Most likely because Flink still thinks of a regular join instead of a time-window JOIN and doesn't emit any resutls. (FYI interest, after executing the query, I convert the Table back to a stream via tEnv.toAppendStream and I use Flink 1.8.0 for tests). My questions are now: 1. How do I see if Flink treats my table result as a regular JOIN result or a time-bounded JOIN? 2. What is the proper way to formulate my initial query, finding all matching events within the same tumbling window? Best regards Theo Diefenthal [1] [ https://de.slideshare.net/FlinkForward/flink-forward-berlin-2018-xingcan-cui-stream-join-in-flink-from-discrete-to-continuous-115374183 | https://de.slideshare.net/FlinkForward/flink-forward-berlin-2018-xingcan-cui-stream-join-in-flink-from-discrete-to-continuous-115374183 ] Slide 18 BQ_END BQ_END --=_dee9df68-bfd3-4790-b210-b6f1d0ef49a0 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Hi Fabian, Hi Zhenghua

Thank you for your suggestions and telling me= that I was on the right track. And good to know how to find out whether so= mething yields to time-bounded or regular join.

@Fabian: Regarding your sugges= ted first option: Isn't that exactly what my first try was? With this TUMBL= E_START... That sadly didn't work due to " Rowtime attr= ibutes must not be in the input rows of a regular join ".= But I'll give option 2 a try by just adding another attribute.

And some additi= on: Regarding my second try: I wrote that the reduced query didn't produce = any data, but that was indeed my mistake. I fiddled around too much with my= data so that I manipulated the original data in a way that the query could= n't output a result any more when testing all of those combinations. Now th= e second attempt works but isn't really what I wanted to query (as the "sam= e day"-predicate is still missing).
Best regards
Theo


Von: "Fabian Hueske" <fhueske@g= mail.com>
An: "Zhenghua Gao" <docete@gmail.com>
CC= : "Theo Diefenthal" <theo.diefenthal@scoop-software.de>, "user" &= lt;user@flink.apache.org>
Gesendet: Freitag, 16. August 2019 1= 0:05:45
Betreff: Re: I'm not able to make a stream-stream Time wi= ndows JOIN in Flink SQL

Hi Theo,

The main problem is = that the semantics of your join (Join all events that happened on the same = day) are not well-supported by Flink yet.

In terms of true st= reaming joins, Flink supports the time-windowed join (with the BETWEEN pred= icate) and the time-versioned table join (which does not apply here).
The first does not really fit because it puts the windows "around = the event", i.e., if you have an event at 12:35 and a window of 10 mins ear= lier and 15 mins later, it will join with events between 12:25 and 12:50.
An other limitation of Flink is that you cannot modify event-time = attributes (well you can, but they lose their event-time property and becom= e regular TIMESTAMP attributes).
This limitation exists, because = we must ensure that the attributes are still aligned with watermarks after = they were modified (or adjusting the watermarks accordingly).
Since analyzing expressions that modify timestamps to figure out whether = they preserve watermark alignment is very difficult, we opted to always rem= ove event-time property when an event-time attribute is modified.

=
I see two options for your use case:

1) use the join tha= t you described before with the -24 and +24 hour window and apply more fine= -grained predicates to filter out the join results that you don't need.
2) add an additional time attribute to your input that is a rounded = down version of the timestamp (rounded to 24h), declare the rounded timesta= mp as your event-time attribute, and join with an equality predicate on the= rounded timestamp.

Best, Fabian

Am Di., 13. Aug. 2019 um 13:41 = Uhr schrieb Zhenghua Gao <docete@gmail.com>:
<= /div>
I wrote a demo example for time windowed join which you can pick up [1]<= br>

<= span style=3D"color:rgb( 79 , 79 , 79 );font-family:'microsoft yahei' , 'sf= pro display' , 'roboto' , 'noto' , 'arial' , 'pingfang sc' , sans-serif;fo= nt-size:12px">Best Regards,
Zhenghu= a Gao


On Tue, Aug 13, 2019 at 4:13 PM Zhenghua= Gao <docete@gmail.com> wrote:
You can che= ck the plan after optimize to verify it's a regular join or time-bounded jo= in(Should have a WindowJoin). The most direct way is breakpoint at optimizi= ng phase [1][2].
On Mon, Aug 12, 2019 at 10:49 PM Theo Diefenthal <theo.diefenthal@scoop-software.de> wrote:
Hi there,

Currently, I'm trying to wr= ite a SQL query which shall executed a time windowed/bounded JOIN on two da= ta streams.

Suppose I have stream1 with attribute id, ts,= user and stream2 with attribute id, ts, userName. I want to receive the na= tural JOIN of both streams with events of the same day.

= In Oracle (With a ts column as number instead of Timestamp, for historical = reasons), I do the following:

<=
span style=3D"color:rgb( 0 , 0 , 128 );font-weight:bold">SELECT *
FROM STREAM1
JOIN STREAM2 ON STREAM1."user" =3D STREAM2."userName"
AND TRUNC(TO_DATE('19700101', 'YYYYMMDD') + ( 1
/ 24 / 60 / 60 / 1000 ) * STREAM1."ts") =3D TRUNC(TO_DATE('19700101', = 'YYYYMMDD'= ) + ( 1 / 24 / 60 / 1000 ) * STREAM2."ts");
wh= ich yields 294 rows with my test data (14 elements from stream1 match to 21= elements in stream2 on the one day of test data). Now I want to query the = same in Flink. So I registered both streams as table and properly registere= d the even-time (by specifying ts.rowtime as table column).

<= div>My goal is to produce a time-windowed JOIN so that, if both streams adv= ance their watermark far enough, an element is written out into an append o= nly stream.

First try (to conform time-bounded-JOIN cond= itions):

Then I found [1], which really is my query but without the= last condition (reduce to matching on the same day). I tried this one as w= ell, just to have a starting point, but the error is the same.
I then reduced the Condition to just one time bound:
SELECT s1.id, s2.id 
FROM STREAM1 AS s1
JOIN STREAM2 AS s2
ON s1.`user`= =3D s2.userName
AND s1.ts BETWEEN s2.ts - INTERVAL '24' HOUR AND s2.ts + INTERVAL '24' HOUR
which runs as a query but doesn'= t produce any results. Most likely because Flink still thinks of a regular = join instead of a time-window JOIN and doesn't emit any resutls. (FYI inter= est, after executing the query, I convert the Table back to a stream via tE= nv.toAppendStream and I use Flink 1.8.0 for tests).

My q= uestions are now:
1. How do I see if Flink treats my table re= sult as a regular JOIN result or a time-bounded JOIN?
2. What= is the proper way to formulate my initial query, finding all matching even= ts within the same tumbling window?

Best regards
Theo Diefenthal

--=_dee9df68-bfd3-4790-b210-b6f1d0ef49a0--