flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tony Wei <tony19920...@gmail.com>
Subject checkpoint failed due to s3 exception: request timeout
Date Wed, 29 Aug 2018 03:35:51 GMT

I met checkpoint failure problem that cause by s3 exception.

> Your socket connection to the server was not read from or written to within
> the timeout period. Idle connections will be closed. (Service: Amazon S3;
> Status Code: 400; Error Code: RequestTimeout; Request ID:
> B8BE8978D3EFF3F5), S3 Extended Request ID:
> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:

   - flink version 1.4.0
   - standalone mode
   - 4 slots for each TM
   - presto s3 filesystem
   - rocksdb statebackend
   - local ssd
   - enable incremental checkpoint

No weird message beside the exception in the log file. No high ratio of GC
during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I
didn't find something that
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3
exception as well. One
reply said you can passively avoid the problem by raising the max client
retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in
flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this
configuration? Thanks in advance.

Tony Wei

[1] https://github.com/aws/aws-sdk-php/issues/885

View raw message