flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: S3 checkpointing in AWS in Frankfurt
Date Thu, 24 Nov 2016 10:28:31 GMT
We have been looking for a while for some way to decouple the S3 filesystem
support from Hadoop.

Does anyone know a good S3 connector library that works independent of
Hadoop and EMRFS?


On Wed, Nov 23, 2016 at 7:57 PM, Greg Hogan <code@greghogan.com> wrote:

> EMRFS looks to *add* cost (and consistency).
> Storing an object to S3 costs "$0.005 per 1,000 requests", so $0.432/day
> at 1 Hz. Is the number of checkpoint files simply parallelism * number of
> operators? That could add up quickly.
> Is the recommendation to run HDFS on EBS?
> On Wed, Nov 23, 2016 at 12:40 PM, Jonathan Share <jon.share@gmail.com>
> wrote:
>> Hi Greg,
>> Standard storage class, everything is on defaults, we've not done
>> anything special with the bucket.
>> Cloud Watch only appears to give me total billing for S3 in general, I
>> don't see a breakdown unless that's something I can configure somewhere.
>> Regards,
>> Jonathan
>> On 23 November 2016 at 16:29, Greg Hogan <code@greghogan.com> wrote:
>>> Hi Jonathan,
>>> Which S3 storage class are you using? Do you have a breakdown of the S3
>>> costs as storage / API calls / early deletes / data transfer?
>>> Greg
>>> On Wed, Nov 23, 2016 at 2:52 AM, Jonathan Share <jon.share@gmail.com>
>>> wrote:
>>>> Hi,
>>>> I'm interested in hearing if anyone else has experience with using
>>>> Amazon S3 as a state backend in the Frankfurt region. For political reasons
>>>> we've been asked to keep all European data in Amazon's Frankfurt region.
>>>> This causes a problem as the S3 endpoint in Frankfurt requires the use of
>>>> AWS Signature Version 4 "This new Region supports only Signature
>>>> Version 4" [1] and this doesn't appear to work with the Hadoop version
>>>> that Flink is built against [2].
>>>> After some hacking we have managed to create a docker image with a
>>>> build of Flink 1.2 master, copying over jar files from the hadoop
>>>> 3.0.0-alpha1 package and this appears to work, for the most part but we
>>>> still suffer from some classpath problems (conflicts between AWS API used
>>>> in hadoop and those we want to use in out streams for interacting with
>>>> Kinesis) and the whole thing feels a little fragile. Has anyone else tried
>>>> this? Is there a simpler solution?
>>>> As a follow-up question, we saw that with checkpointing on three
>>>> relatively simple streams set to 1 second, our S3 costs were higher than
>>>> the EC2 costs for our entire infrastructure. This seems slightly
>>>> disproportionate. For now we have reduced checkpointing interval to 10
>>>> seconds and that has greatly improved the cost projections graphed via
>>>> Amazon Cloud Watch, but I'm interested in hearing other peoples experience
>>>> with this. Is that the kind of billing level we can expect or is this a
>>>> symptom of a mis-configuration? Is this a setup others are using? As we are
>>>> using Kinesis as the source for all streams I don't see a huge risk with
>>>> larger checkpoint intervals and our Sinks are designed to mostly tolerate
>>>> duplicates (some improvements can be made).
>>>> Thanks in advance
>>>> Jonathan
>>>> [1] https://aws.amazon.com/blogs/aws/aws-region-germany/
>>>> [2] https://issues.apache.org/jira/browse/HADOOP-13324

View raw message