flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xingcan Cui <xingc...@gmail.com>
Subject Re: [VOTE] How to Deal with Split/Select in DataStream API
Date Mon, 08 Jul 2019 04:39:27 GMT
Hi all,

Thanks for your participation.

In this thread, we got one +1 for option 1 and option 3, respectively. In the original thread[1],
we got two +1 for option 1, one +1 for option 2, and five +1 and one -1 for option 3.

To summarize,

Option 1 (port side output to flatMap and deprecate split/select): three +1
Option 2 (introduce a new split/select and deprecate existing one): one +1
Option 3 ("correct" the existing split/select): six +1 and one -1

It seems that most people involved are in favor of "correcting" the existing split/select.
However, this will definitely break the API compatibility, in a subtle way.

IMO, the real behavior of consecutive split/select's has never been thoroughly clarified.
Even in the community, it hard to say that we come into a consensus on its real semantics[2-4].
Though the initial design is not ambiguous, there's no doubt that its concept has drifted.

As the split/select is quite an ancient API, I cc'ed this to more members. It couldn't be
better if you can share your opinions on this.


[1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E
[2] https://issues.apache.org/jira/browse/FLINK-1772
[3] https://issues.apache.org/jira/browse/FLINK-5031
[4] https://issues.apache.org/jira/browse/FLINK-11084

> On Jul 5, 2019, at 12:04 AM, 杨力 <bill.lee.y@gmail.com> wrote:
> I prefer the 1) approach. I used to carry fields, which is needed only for splitting,
in the outputs of flatMap functions. Replacing it with outputTags would simplify data structures.
> Xingcan Cui <xingcanc@gmail.com <mailto:xingcanc@gmail.com>> 于 2019年7月5日周五
> Hi folks,
> Two weeks ago, I started a thread [1] discussing whether we should discard the split/select
methods (which have been marked as deprecation since v1.7) in DataStream API. 
> The fact is, these methods will cause "unexpected" results when using consecutively (e.g.,
ds.split(a).select(b).split(c).select(d)) or multi-times on the same target (e.g., ds.split(a).select(b),
ds.split(c).select(d)). The reason is that following the initial design, the new split/select
logic will always override the existing one on the same target operator, rather than append
to it. Some users may not be aware of that, but if you do, a current solution would be to
use the more powerful side output feature [2].
> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> added some restrictions
to the existing split/select logic and suggest to replace it with side output in the future.
However, considering that the side output is currently only available in the process function
layer and the split/select could have been widely used in many real-world applications, we'd
like to start a vote andlisten to the community on how to deal with them.
> In the discussion thread [1], we proposed three solutions as follows. All of them are
feasible but have different impacts on the public API.
> 1) Port the side output feature to DataStream API's flatMap and replace split/select
with it.
> 2) Introduce a dedicated function in DataStream API (with the "correct" behavior but
a different name) that can be used to replace the existing split/select.
> 3) Keep split/select but change the behavior/semantic to be "correct".
> Note that this is just a vote for gathering information, so feel free to participate
and share your opinions.
> The voting time will end on July 7th 17:00 EDT.
> Thanks,
> Xingcan
> [1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E
> [2] https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html

View raw message