flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From XilangYan <xilang....@gmail.com>
Subject Re: Let BucketingSink roll file on each checkpoint
Date Fri, 29 Jun 2018 01:03:36 GMT
Hi Febian,

Finally I have time to read the code, and it is brilliant it does provide
exactly once guarantee。
However I still suggest to add the function that can close a file when
checkpoint made. I noticed that there is an enhancement
https://issues.apache.org/jira/browse/FLINK-9138 which can close file on a
time based rollover, but it is not very accurate.
My user case is we read data from message queue, write to HDFS, and our ETL
team will use the data in HDFS. In the case, ETL need to know if all data is
ready to be read accurately, so we use a counter to count how many data has
been wrote, if the counter is equal to the number we received, we think HDFS
file is ready. We send the counter message in a custom sink so ETL can know
how many data has been wrote, but if use current BucketingSink, even through
HDFS file is flushed, ETL may still cannot read the data. If we can close
file during checkpoint, then the result is accurately. And for the HDFS
small file problem, it can be controller by use bigger checkpoint interval.

I did take the BuckingSink code and adapt our case, but if it can be done in
Flink, we can save to time to maintain our own branch.


Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

View raw message