flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Lam <paullin3...@gmail.com>
Subject Re: StreamingFileSink duplicate data
Date Thu, 21 Nov 2019 08:33:48 GMT

StreamingFileSink would not remove committed files, so if you use a non-latest checkpoint
to restore state, you may need to perform a manual cleanup.

WRT the part id issue, StreamingFileSink will track the global max part number, and use this
value + 1 as the new id upon restoring. In this way, we avoid file name conflicts with the
previous execution (see[1]).

[1] https://github.com/apache/flink/blob/93dfdd05a84f933473c7b22437e12c03239f9462/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L276

Paul Lam

> 在 2019年11月21日,10:01,Lei Nie <lyzjn81@gmail.com> 写道:
> Hello,
> I would like clarification on the StreamingFileSink, thank you.
> From my testing, it seems that resuming job from checkpoint does not also restore the
rolling part counter.
> E.g, job may have stopped with last file:
> part-6-71
> But when resuming from most recent checkpoint:
> part-6-89
> (There is unexplained gap).
> This is a problem if I am having an issue with my job, and need to roll back more than
one checkpoint. After rolling back to the 4th last checkpoint, e.g, the data will be written
into different part file names, causing duplication.
> -----------------------------------------------------------------
> For example, checkpoints:
> chk-17, chk-18, chk-19, chk-20
> Original data:
> part-1-5, part-1-6, part-1-7
> Rollback to chk-17, which writes part-1-18, but with the same data as part-1-5! This
is duplicate.
> ------------------------------------------------------------------
> Am I correct? How to avoid this?

View raw message