flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruno Aranda <bara...@apache.org>
Subject Re: StreamingFileSink seems to be overwriting existing part files
Date Fri, 29 Mar 2019 15:38:27 GMT
Hi Kostas,

Put that way, sounds fair enough. Many thanks for the clarification,

Cheers,

Bruno

On Fri, 29 Mar 2019 at 15:32, Kostas Kloudas <kkloudas@gmail.com> wrote:

> Hi Bruno,
>
> This is the expected behaviour as the job starts "fresh", given that you
> did not specify any savepoint/checkpoint to start from.
>
> As for the note that "One would expect that it finds the last part and
> gets the next free number?",
> I am not sure how this can be achieved safely and efficiently in an
> eventually consistent object store like s3.
> This is actually the reason why, contrary to the BucketingSink, the
> StreamingFileSink relies on Flink's own state to determine the "next" part
> counter.
>
> Cheers,
> Kostas
>
> On Fri, Mar 29, 2019 at 4:22 PM Bruno Aranda <baranda@apache.org> wrote:
>
>> Hi,
>>
>> One of the main reasons we moved to version 1.7 (and 1.7.2 in particular)
>> was because of the possibility of using a StreamingFileSink with S3.
>>
>> We've configured a StreamingFileSink to use a DateTimeBucketAssigner to
>> bucket by day. It's got a parallelism of 1 and is writing to S3 from an EMR
>> cluster in AWS.
>>
>> We ran the job and after a few hours of activity, manually cancelled it
>> through the jobmanager API. After confirming that a number of "part-0-x"
>> files existed in S3 at the expected path, we then started the job again
>> using the same invocation of the CLI "flink run..." command that was
>> originally used to start it.
>>
>> It started writing data to S3 again, starting afresh from "part-0-0",
>> which gradually overwrote the existing data.
>>
>> I can understand not having used a checkpoint gives no indication on
>> where to resume, but the fact that it overwrites the existing files (as it
>> starts to write to part-0.0 again) is surprising. One would expect that it
>> finds the last part and gets the next free number?
>>
>> We're definitely using the flink-s3-fs-hadoop-1.7.2.jar and don't have
>> the presto version on the classpath.
>>
>> Is this its expected behaviour? We have not seen this in the non
>> streaming versions of the sink.
>>
>> Best regards,
>>
>> Bruno
>>
>

Mime
View raw message