flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lu Niu <qqib...@gmail.com>
Subject Re: Debug Slowness in Async Checkpointing
Date Fri, 24 Apr 2020 04:37:45 GMT
Hi, Robert

Thanks for relying. Yeah. After I added monitoring on the above path, it
shows the slowness did come from uploading file to s3. Right now I am still
investigating the issue. At the same time, I am trying PrestoS3FileSystem
to check whether that can mitigate the problem.

Best
Lu

On Thu, Apr 23, 2020 at 8:10 AM Robert Metzger <rmetzger@apache.org> wrote:

> Hi Lu,
>
> were you able to resolve the issue with the slow async checkpoints?
>
> I've added Yu Li to this thread. He has more experience with the state
> backends to decide which monitoring is appropriate for such situations.
>
> Best,
> Robert
>
>
> On Tue, Apr 21, 2020 at 10:50 PM Lu Niu <qqibrow@gmail.com> wrote:
>
>> Hi, Robert
>>
>> Thanks for replying. To improve observability , do you think we should
>> expose more metrics in checkpointing? for example, in incremental
>> checkpoint, the time spend on uploading sst files?
>> https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319
>>
>> Best
>> Lu
>>
>>
>> On Fri, Apr 17, 2020 at 11:31 AM Robert Metzger <rmetzger@apache.org>
>> wrote:
>>
>>> Hi,
>>> did you check the TaskManager logs if there are retries by the s3a file
>>> system during checkpointing?
>>>
>>> I'm not aware of any metrics in Flink that could be helpful in this
>>> situation.
>>>
>>> Best,
>>> Robert
>>>
>>> On Tue, Apr 14, 2020 at 12:02 AM Lu Niu <qqibrow@gmail.com> wrote:
>>>
>>>> Hi, Flink users
>>>>
>>>> We notice sometimes async checkpointing can be extremely slow, leading
>>>> to checkpoint timeout. For example, For a state size around 2.5MB, it could
>>>> take 7~12min in async checkpointing:
>>>>
>>>> [image: Screen Shot 2020-04-09 at 5.04.30 PM.png]
>>>>
>>>> Notice all the slowness comes from async checkpointing, no delay in
>>>> sync part and barrier assignment. As we use rocksdb incremental
>>>> checkpointing, I notice the slowness might be caused by uploading the file
>>>> to s3. However, I am not completely sure since there are other steps in
>>>> async checkpointing. Does flink expose fine-granular metrics to debug such
>>>> slowness?
>>>>
>>>> setup: flink 1.9.1, rocksdb incremental state backend,
>>>> S3AHaoopFileSystem
>>>>
>>>> Best
>>>> Lu
>>>>
>>>

Mime
View raw message