flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tony Wei <tony19920...@gmail.com>
Subject checkpoint failed due to s3 exception: request timeout
Date Wed, 29 Aug 2018 03:35:51 GMT
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
> Your socket connection to the server was not read from or written to within
> the timeout period. Idle connections will be closed. (Service: Amazon S3;
> Status Code: 400; Error Code: RequestTimeout; Request ID:
> B8BE8978D3EFF3F5), S3 Extended Request ID:
> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=


The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:

   - flink version 1.4.0
   - standalone mode
   - 4 slots for each TM
   - presto s3 filesystem
   - rocksdb statebackend
   - local ssd
   - enable incremental checkpoint

No weird message beside the exception in the log file. No high ratio of GC
during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I
didn't find something that
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3
exception as well. One
reply said you can passively avoid the problem by raising the max client
retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in
flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this
configuration? Thanks in advance.

Best,
Tony Wei

[1] https://github.com/aws/aws-sdk-php/issues/885
[2]
https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218

Mime
View raw message