flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yun Gao" <yungao...@aliyun.com>
Subject 回复:Re: Writing _SUCCESS Files (Streaming and Batch)
Date Tue, 12 May 2020 08:36:59 GMT
Hi Peter,

    Sorry for missing the question and response later, I'm currently sworking together with
Jingsong on the issue to support "global committing" (like writing _SUCCESS file or adding
partitions to hive store) after buckets terminated. In 1.11 we may first support watermark/time
related buckets in Table/SQL API, and we are also thinking of supporting "global committing"
for arbitrary bucket assigner policy for StreamingFileSink users. The current rough thought
is to let users specify when a bucket is terminated on a single task, and the OperatorCoordinator[1]
of the sink will aggreate the information from all subtasks about this bucket and do the global
committing if the bucket has been finished on all the subtasks, but this is still under thinking
and discussion. Any thoughts or requirements on this issue are warmly welcome. 

Best,
 Yun


[1] OperatorCoordinator is introduced in FLIP-27: https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface.
This is a component resides in JobManager and could communicate with all the subtasks of the
corresponding operator, thus it could be used to aggregate status from subtasks. 


 ------------------原始邮件 ------------------
发件人:Robert Metzger <rmetzger@apache.org>
发送时间:Tue May 12 15:36:26 2020
收件人:Jingsong Li <jingsonglee0@gmail.com>
抄送:Peter Groesbeck <peter.groesbeck@gmail.com>, user <user@flink.apache.org>
主题:Re: Writing _SUCCESS Files (Streaming and Batch)

Hi Peter,

I filed a ticket for this feature request: https://issues.apache.org/jira/browse/FLINK-17627
(feel free to add your thoughts / requirements to the ticket)

Best,
Robert


On Wed, May 6, 2020 at 3:41 AM Jingsong Li <jingsonglee0@gmail.com> wrote:

Hi Peter,

The troublesome is how to know the "ending" for a bucket in streaming job.
In 1.11, we are trying to implement a watermark-related bucket ending mechanism[1] in Table/SQL.

[1]https://cwiki.apache.org/confluence/display/FLINK/FLIP-115%3A+Filesystem+connector+in+Table

Best,
Jingsong Lee
On Tue, May 5, 2020 at 7:40 AM Peter Groesbeck <peter.groesbeck@gmail.com> wrote:

I am replacing an M/R job with a Streaming job using the StreamingFileSink and there is a
requirement to generate an empty _SUCCESS file like the old Hadoop job. I have to implement
a similar Batch job to read from backup files in case of outages or downtime.

The Batch job question was answered here and appears to be still relevant although if someone
could confirm for me that would be great.
https://stackoverflow.com/a/39413810

The question of the Streaming job came up back in 2018 here:
http://mail-archives.apache.org/mod_mbox/flink-user/201802.mbox/%3CFF74EED5-602F-4EAA-9BC1-6CDF56611267@gmail.com%3E

But the solution to use or extend the BucketingSink class seems out of date now that BucketingSink
has been deprecated.

Is there a way to implement a similar solution for StreamingFileSink?

I'm currently on 1.8.1 although I hope to update to 1.10 in the near future.

Thank you,
Peter

-- 
Best, Jingsong Lee
Mime
View raw message