flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jark Wu" <wuchong...@alibaba-inc.com>
Subject Re: How to contribute to Streaming Table API and StreamSQL
Date Fri, 17 Jun 2016 07:48:51 GMT
Hi Fabian,

There are a lot of our business are using non-windowed aggregations. And there is a little
difference between non-windowed aggregate and Row window operator, as the later is bound to
a certain window and emit the result of the N rows preceding for every incoming row. However
the former emit the aggregate result of the whole elements. So I suggest to add them for more
complete semantic.

Regarding the windowed aggregate task, I’m agree with that and I'm looking forward as soon
as possible to see the corresponding JIRA issues created. After that, we can start working
on an independent branch without waiting for 1.1 released. But I’m still a little concerned
about Calcite’s support, as we must waiting for Calcite supporting correspond syntax and
the  version released. If we can separate the task into Table API and SQL  , we may not be
blocked by Calcite too much.

What do you think? 

- Jark Wu 

> 在 2016年6月16日,下午8:37,Fabian Hueske <fhueske@gmail.com> 写道:
> 
> Hi Jark,
> 
> thanks for sharing Blink's Streaming Table API. It seems to be close to the
> DataStream API, while the Table API draft I shared is more similar to
> Calcite's proposal.
> You are right, the current draft does not include running (non-windowed)
> aggregates. We were not sure how useful these are since these aggregates
> are unbound and might become meaningless after being applied on a very long
> stream. However, we can certainly add them, if users request them.
> An alternative to running aggregates could be what I called "Row window
> operators" in the streaming Table API draft. These operators emit an
> aggregate for each incoming row, however the aggregate is bound to a
> certain window around the row like the 10 rows preceding the row for which
> the aggregate is computed. Calcite calls these windows "Sliding windows"
> (Attention: This is different from Flink's terminology, in Flink sliding
> windows are something different). Row windows are similar to running
> aggregates in that they emit a row for each incoming row. You can also
> think of them as a (Flink) sliding count window which is evaluate for each
> incoming record.
> 
> Further differences are the support of Scalar UDFs in the Table API and the
> support for joins which have not been drafted for the Table API yet.
> Scalar UDFs are definitely also on our roadmap and with upcoming support
> for side inputs, the DataStream API will also support more types of joins.
> 
> Regarding the current state of Stream SQL in Calcite I am not up to date.
> 
> I would propose to start with the effort of adding support for windowed
> aggregates as follows:
> 
> 1) Add support to define a timestamp / watermark extractor to tables. This
> includes to define a "quasi-monotone" column in a Table's schema. Calcite
> will use this information to reason about the validity of a query (making
> sure that grouping includes at least one monotone attribute).
> 2) Add support for sorting a stream on the timestamp attribute. While
> sorting itself is not very exciting, it is an easy operation and can be
> immediately implemented without worrying about API questions. This will
> also show how well Calcite supports the reasoning about monotone attributes.
> 3) Add support for tumbling windows.
> 
> In each of these steps we might need to get involved with the Calcite
> community, depending on Calcite's current support for "quasi-monotone"
> attributes, etc.
> 
> What do you think?
> 
> Best, Fabian
> 
> 
> 2016-06-14 11:03 GMT+02:00 Jark Wu <wuchong.wc@alibaba-inc.com>:
> 
>> Hi Fabian,
>> 
>> It’s great to hear that we are going to start it!
>> 
>> I’m glad to share our current Streaming Table API [1]. I find that that
>> all aggregation functions are scoped to the defined window in Flink Stream
>> Table API design [2] and Calcite StreamSQL desgin [3]. I’m thinking that do
>> we need global aggregation? The global aggregation means that aggregation
>> is applied only on grouped key not including window which is supported in
>> DataStream `datastream.keyBy(f1).sum(f2)`.
>> 
>> Since the window syntax of StreamSQL is not implemented yet, so will we
>> help Calcite community with that first or work code for window+agg Table
>> API first ?
>> 
>> 
>> [1]
>> https://docs.google.com/document/d/1KMUzvBAWSyQ39T8MyxUi0zNHyvLUnyGMPA7_RLSDpFw/edit?usp=sharing
>> <
>> https://docs.google.com/document/d/1KMUzvBAWSyQ39T8MyxUi0zNHyvLUnyGMPA7_RLSDpFw/edit?usp=sharing
>>> 
>> [2]
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E/edit#
>> <
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E/edit#
>>> 
>> [3] https://calcite.apache.org/docs/stream.html#tumbling-windows <
>> https://calcite.apache.org/docs/stream.html#tumbling-windows>
>> 
>> 
>> - Jark Wu
>> 
>>> 在 2016年6月14日,上午1:10,Fabian Hueske <fhueske@gmail.com> 写道:
>>> 
>>> Hi Jark,
>>> 
>>> wow, that's good news!
>>> You are right, the streaming Table API is currently very limited. In the
>>> last month's we changed the internal architecture and put everything on
>> top
>>> of Apache Calcite.
>>> For the upcoming 1.1 release, we won't add new features to the Table API
>> /
>>> SQL. However for the 1.2 release, it we plan to focus on the streaming
>>> Table API and Stream SQL to add support for windowed aggregates and
>> joins,
>>> which corresponds to Task 7 and 9 in the design document. You are
>>> completely right, that we should start to add tickets to JIRA for this. I
>>> will do that tomorrow.
>>> 
>>> It is great that you have already working code for windowed aggregates
>> and
>>> joins! Here is a link to our current API draft for windows in the Table
>> API
>>> [1]. Would be great if you could share how your API looks like. As you
>>> said, the internals have changed a lot by now, but we might want to reuse
>>> your API for Table API windows and maybe the code of the runtime.
>> However,
>>> we need to go through Calcite for optimization and SQL support, so some
>>> parts need to be definitely changed. Stream SQL is also on the roadmap of
>>> the Calcite community, but it might be that some features that we will
>> need
>>> are not completed yet. So, maybe we help the Calcite community with that
>> as
>>> well.
>>> 
>>> If you want to contribute, you should first read the how to contribute
>>> guide [2] and guide for code contributions [3].
>>> The general rule is to first open a JIRA and later a pull request.
>> Changes
>>> that are extensive or modify current behavior (except bugs) should be
>>> discussed before starting to work on them.
>>> 
>>> Looking forward to work with you on Flink,
>>> Fabian
>>> 
>>> [1]
>>> 
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E/edit#heading=h.3iw7frfjdcb2
>>> [2] http://flink.apache.org/how-to-contribute.html
>>> [3] http://flink.apache.org/contribute-code.html
>>> 
>>> 
>>> 2016-06-13 11:31 GMT+02:00 Jark Wu <wuchong.wc@alibaba-inc.com>:
>>> 
>>>> Hi,
>>>> 
>>>> We have a great interest in the new Table API & SQL. In Alibaba, we have
>>>> added a lot of features (groupBy, agg, window, join, UDF …) to Streaming
>>>> Table API (base on Flink 1.0). Now, many jobs run on Table API in
>>>> production environment. But we want to keep pace with the community,
>> and we
>>>> have noticed that Flink Community reworked the Table API and also
>> supported
>>>> SQL. That is really cool. However, the Streaming Table API is still so
>>>> weak. So we want to contribute to accelerate the Streaming Table API and
>>>> StreamSQL growth.
>>>> 
>>>> It seems that we have complete Task-5 and Task-6 mentioned in the Work
>>>> Plan <
>>>> 
>> https://docs.google.com/document/d/1TLayJNOTBle_-m1rQfgA6Ouj1oYsfqRjPcp1h2TVqdI/edit#
>>> .
>>>> So can we start Task-7 and Task-9 now? Is there any more specific
>> plans? I
>>>> think it’s better to create an umbrella JIRA like FLINK-3221 to make the
>>>> develop plan clearer.
>>>> 
>>>> If I want to contribute code for groupBy and agg function, what should I
>>>> do? As I didn’t find related JIRAs, can I create JIRA and pull a request
>>>> directly?
>>>> 
>>>> Sorry for so many questions at a time.
>>>> 
>>>> 
>>>> 
>>>> - Jark Wu (wuchong)
>>>> 
>>>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message