flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vino yang <yanghua1...@gmail.com>
Subject Re: Why checkpoint took so long
Date Fri, 17 Aug 2018 12:23:41 GMT
Hi Alex,

Have you checked if Flink has caused a timeout when accessing the file
system?

Can you give JM's log and checkpoint specific log in TM.

Thanks, vino.

Alex Vinnik <alvinnik.g@gmail.com> 于2018年8月17日周五 上午11:51写道:

> Vino,
>
> 1. No custom implementations for source and checkpoints. Source is json
> files on s3.
>
> JsonLinesInputFormat format = new JsonLinesInputFormat(new Path(customerPath), configuration);
> format.setFilesFilter(FilePathFilter.createDefaultFilter());
> // Read JSON objects from the given path, monitoring it continuously for updates
> env
>    .readFile(format, customerPath, FileProcessingMode.PROCESS_CONTINUOUSLY, pollInterval.toMillis())
>
>
> RocksDB is used as sate backend.
>
> 2. Majority of checkpoints timeout after 15 minutes.
>
> Thanks
>
> On Thu, Aug 16, 2018 at 8:48 PM vino yang <yanghua1127@gmail.com> wrote:
>
>> Hi Alex,
>>
>> I still have a few questions:
>>
>> 1) Is this file source and checkpoint logic implemented by you? .
>> 2) Other failed checkpoints, can you give the corresponding failure log
>> or more descriptions, such as failure due to timeout, or other reasons?
>>
>> Thanks, vino.
>>
>> Alex Vinnik <alvinnik.g@gmail.com> 于2018年8月17日周五 上午3:03写道:
>>
>>> I noticed a strange thing in flink 1.3 checkpointing. Checkpoint
>>> succeeded, but took so long 15 minutes 53 seconds. Size of checkpoint
>>> metadata on s3 is just 1.7MB. Most of the time checkpoints actually fails.
>>>
>>> aws --profile cure s3 ls --recursive --summarize --human
>>> s3://curation-two-admin/flink/sa-checkpoint/sa1/checkpoint_metadata-c99cfda10951
>>> 2018-08-16 13:34:07    1.7 MiB
>>> flink/sa-checkpoint/sa1/checkpoint_metadata-c99cfda10951
>>>
>>> I came this discussion
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoints-very-slow-with-high-backpressure-td12762.html#a19370.
>>> But it looks like the problem was caused by high back pressure. Not the
>>> case for me.
>>>
>>> taskmanager.network.memory.max    128 MB
>>> very small, I was hoping to get faster checkpoints with smaller buffers.
>>> Reading from durable storage (s3) and don't worry about buffering reads due
>>> to slow writing.
>>>
>>> Any ideas, what can cause such slow checkpointing? Thanks. -Alex
>>>
>>> [image: Screen Shot 2018-08-16 at 1.43.23 PM.png]
>>>
>>

Mime
View raw message