flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Foster, Craig" <foscr...@amazon.com>
Subject Re: S3 checkpointing in AWS in Frankfurt
Date Wed, 23 Nov 2016 17:00:38 GMT
I would suggest using EMRFS anyway, which is the way to access the S3 file system from EMR
(using the same s3:// prefixes).  That said, you will run into the same shading issues in
our build until the next release—which is coming up relatively shortly.

From: Robert Metzger <rmetzger@apache.org>
Reply-To: "user@flink.apache.org" <user@flink.apache.org>
Date: Wednesday, November 23, 2016 at 8:24 AM
To: "user@flink.apache.org" <user@flink.apache.org>
Subject: Re: S3 checkpointing in AWS in Frankfurt

Hi Jonathan,

have you tried using Amazon's latest EMR Hadoop distribution? Maybe they've fixed the issue
in their for older Hadoop releases?

On Wed, Nov 23, 2016 at 4:38 PM, Scott Kidder <kidder.scott@gmail.com<mailto:kidder.scott@gmail.com>>
Hi Jonathan,

You might be better off creating a small Hadoop HDFS cluster just for the purpose of storing
Flink checkpoint & savepoint data. Like you, I tried using S3 to persist Flink state,
but encountered AWS SDK issues and felt like I was going down an ill-advised path. I then
created a small 3-node HDFS cluster in the same region as my Flink hosts but distributed across
3 AZs. The checkpointing is very fast and, most importantly, just works.

Is there a firm requirement to use S3, or could you use HDFS instead?


--Scott Kidder

On Tue, Nov 22, 2016 at 11:52 PM, Jonathan Share <jon.share@gmail.com<mailto:jon.share@gmail.com>>

I'm interested in hearing if anyone else has experience with using Amazon S3 as a state backend
in the Frankfurt region. For political reasons we've been asked to keep all European data
in Amazon's Frankfurt region. This causes a problem as the S3 endpoint in Frankfurt requires
the use of AWS Signature Version 4 "This new Region supports only Signature Version 4" [1]
and this doesn't appear to work with the Hadoop version that Flink is built against [2].

After some hacking we have managed to create a docker image with a build of Flink 1.2 master,
copying over jar files from the hadoop 3.0.0-alpha1 package and this appears to work, for
the most part but we still suffer from some classpath problems (conflicts between AWS API
used in hadoop and those we want to use in out streams for interacting with Kinesis) and the
whole thing feels a little fragile. Has anyone else tried this? Is there a simpler solution?

As a follow-up question, we saw that with checkpointing on three relatively simple streams
set to 1 second, our S3 costs were higher than the EC2 costs for our entire infrastructure.
This seems slightly disproportionate. For now we have reduced checkpointing interval to 10
seconds and that has greatly improved the cost projections graphed via Amazon Cloud Watch,
but I'm interested in hearing other peoples experience with this. Is that the kind of billing
level we can expect or is this a symptom of a mis-configuration? Is this a setup others are
using? As we are using Kinesis as the source for all streams I don't see a huge risk with
larger checkpoint intervals and our Sinks are designed to mostly tolerate duplicates (some
improvements can be made).

Thanks in advance

[1] https://aws.amazon.com/blogs/aws/aws-region-germany/
[2] https://issues.apache.org/jira/browse/HADOOP-13324

View raw message