flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael-Keith Bernard <mkbern...@opentable.com>
Subject Re: Flink + S3
Date Tue, 19 Apr 2016 23:35:25 GMT
Hey Till & Ufuk,

We're running on self-managed EC2 instances (and we'll eventually have a mirror cluster in
our colo). The provided documentation notes that for Hadoop 2.6, we'd need such-and-such version
of hadoop-aws and guice on the CP. If I wanted to instead use Hadoop 2.7, which versions of
those dependencies should I get? And how can I look that up myself? The pom file for hadoop-aws[1]
doesn't mention a specific dependency on Guice, so I'm curious how the author of that documentation
knew exactly the dependencies and versions required.

Let me switch my questioning slightly:

What is the best (most widely supported, most common, easiest to use, easiest to scale, etc)
way to deploy Flink today? I've been operating under the assumption that, since we have no
existing Hadoop infrastructure, the path of least resistance is a stand-alone cluster. However
it seems like Flink is still relatively tightly coupled to the Hadoop platform, so maybe I
would be better off switching to Hadoop + YARN? Our requirements are simple (for now):

Kafka (consumer & producer), S3 (read & write), streaming- and batch-mode computation

If the answer turns out to be that YARN is the best path forward for us, do you have any recommendations
on how to get started building a minimal, but production ready Hadoop cluster suitable for
Flink? Ambari looks amazing, so barring feedback to the contrary I'll probably be investing
time looking at that first.

Finally, any relevant book recommendations? :) I'm extremely excited about this project, so
all the feedback I can get is highly welcome and highly appreciated!


P.S. Is there planned support for Mesos as an alternative scheduler to YARN?

[1]: http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.2/hadoop-aws-2.7.2.pom

From: Ufuk Celebi <uce@apache.org>
Sent: Tuesday, April 19, 2016 2:30 AM
To: user@flink.apache.org
Subject: Re: Flink + S3

Hey Michael-Keith,

are you running self-managed EC2 instances or EMR?

In addition to what Till said:

We tried to document this here as well:

Does this help? You don't need to really install Hadoop, but only
provide the configuration and the S3 FileSystem code on your

If you use EMR + Flink on YARN, it should work out of the box.

– Ufuk

On Tue, Apr 19, 2016 at 10:23 AM, Till Rohrmann <trohrmann@apache.org> wrote:
> Hi Michael-Keith,
> you can use S3 as the checkpoint directory for the filesystem state backend.
> This means that whenever a checkpoint is performed the state data will be
> written to this directory.
> The same holds true for the zookeeper recovery storage directory. This
> directory will contain the submitted and not yet finished jobs as well as
> some meta data for the checkpoints. With this information it is possible to
> restore running jobs if the job manager dies.
> As far as I know, Flink relies on Hadoop's file system wrapper classes to
> support S3. Flink has built in support for hdfs, maprfs and the local file
> system. For everything else, Flink tries to find a Hadoop class. Therefore,
> I fear that you need at least Hadoop's s3 filesystem class in your classpath
> and a file called core-site.xml or hdfs-site.xml which is stored at a
> location specified by fs.hdfs.hdfsdefault in Flink's configuration. And in
> one of these files you have to create the xml tag to specify the class. But
> the easiest way would be to simply install Hadoop.
> I'm not aware of any puppet scripts but I might miss something here. If you
> should complete a puppet script, then it would definitely be a valuable
> addition to Flink :-)
> Cheers,
> Till
> On Tue, Apr 19, 2016 at 3:54 AM, Michael-Keith Bernard
> <mkbernard@opentable.com> wrote:
>> Hello Flink Users!
>> I'm a Flink newbie at the early stages of deploying our first Flink
>> cluster into production and I have a few questions about wiring up Flink
>> with S3:
>> * We are going to use the HA configuration[1] from day one (we have
>> existing zk infrastructure already). Can S3 be used as a state backend for
>> the Job Manager? The documentation talks about using S3 as a state backend
>> for TM[2] (and in particular for streaming), but I'm wondering if it's a
>> suitable backend for the JM as well.
>> * How do I configure S3 for Flink when I don't already have an existing
>> Hadoop cluster? The documentation references the Hadoop configuration
>> manifest[3], which kind of implies to me that I must already be running
>> Hadoop (or at least have a properly configured Hadoop cluster). Is there an
>> example somewhere of using S3 as a storage backend for a standalone cluster?
>> * Bonus: I'm writing a Puppet module for installing/configuring/managing
>> Flink in stand alone mode with an existing zk cluster. Are there any
>> existing modules for this (I didn't find anything in the forge)? Would
>> others in the community be interested if we added our module to the forge
>> once complete?
>> Thanks so much for your time and consideration. We look forward to using
>> Flink in production!
>> Cheers,
>> Michael-Keith
>> [1]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#standalone-cluster-high-availability
>> [2]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#s3-simple-storage-service
>> [3]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#set-s3-filesystem
View raw message