flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Share <jon.sh...@gmail.com>
Subject Re: S3 checkpointing in AWS in Frankfurt
Date Wed, 23 Nov 2016 17:40:53 GMT
Hi Greg,

Standard storage class, everything is on defaults, we've not done anything
special with the bucket.

Cloud Watch only appears to give me total billing for S3 in general, I
don't see a breakdown unless that's something I can configure somewhere.


On 23 November 2016 at 16:29, Greg Hogan <code@greghogan.com> wrote:

> Hi Jonathan,
> Which S3 storage class are you using? Do you have a breakdown of the S3
> costs as storage / API calls / early deletes / data transfer?
> Greg
> On Wed, Nov 23, 2016 at 2:52 AM, Jonathan Share <jon.share@gmail.com>
> wrote:
>> Hi,
>> I'm interested in hearing if anyone else has experience with using Amazon
>> S3 as a state backend in the Frankfurt region. For political reasons we've
>> been asked to keep all European data in Amazon's Frankfurt region. This
>> causes a problem as the S3 endpoint in Frankfurt requires the use of AWS
>> Signature Version 4 "This new Region supports only Signature Version 4"
>> [1] and this doesn't appear to work with the Hadoop version that Flink is
>> built against [2].
>> After some hacking we have managed to create a docker image with a build
>> of Flink 1.2 master, copying over jar files from the hadoop
>> 3.0.0-alpha1 package and this appears to work, for the most part but we
>> still suffer from some classpath problems (conflicts between AWS API used
>> in hadoop and those we want to use in out streams for interacting with
>> Kinesis) and the whole thing feels a little fragile. Has anyone else tried
>> this? Is there a simpler solution?
>> As a follow-up question, we saw that with checkpointing on three
>> relatively simple streams set to 1 second, our S3 costs were higher than
>> the EC2 costs for our entire infrastructure. This seems slightly
>> disproportionate. For now we have reduced checkpointing interval to 10
>> seconds and that has greatly improved the cost projections graphed via
>> Amazon Cloud Watch, but I'm interested in hearing other peoples experience
>> with this. Is that the kind of billing level we can expect or is this a
>> symptom of a mis-configuration? Is this a setup others are using? As we are
>> using Kinesis as the source for all streams I don't see a huge risk with
>> larger checkpoint intervals and our Sinks are designed to mostly tolerate
>> duplicates (some improvements can be made).
>> Thanks in advance
>> Jonathan
>> [1] https://aws.amazon.com/blogs/aws/aws-region-germany/
>> [2] https://issues.apache.org/jira/browse/HADOOP-13324

View raw message