flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ufuk Celebi <...@apache.org>
Subject Re: Flink + S3
Date Wed, 20 Apr 2016 10:20:28 GMT
On Wed, Apr 20, 2016 at 1:35 AM, Michael-Keith Bernard
<mkbernard@opentable.com> wrote:
> We're running on self-managed EC2 instances (and we'll eventually have a mirror cluster
in our colo). The provided documentation notes that for Hadoop 2.6, we'd need such-and-such
version of hadoop-aws and guice on the CP. If I wanted to instead use Hadoop 2.7, which versions
of those dependencies should I get? And how can I look that up myself? The pom file for hadoop-aws[1]
doesn't mention a specific dependency on Guice, so I'm curious how the author of that documentation
knew exactly the dependencies and versions required.

Hey Michael-Keith,

I think you meant Guava and not Guice.

How to determine, which dependencies you need is quite a mess at the
moment. It depends on a combination of 3 things:
1) the dependencies of hadoop-aws [1],
2) which S3 file system you use (in case of the docs
org.apache.hadoop.fs.s3native.NativeS3FileSystem) [2],
3) what Flink shades away in its Hadoop dependencies [3]

1) hadoop-aws depends on hadoop-common (and other packages).
hadoop-common is already part of Flink (including the fs.FileSystem
classes etc.)
2) NativeS3FileSystem uses dependencies from hadoop-common like
FileSystem and from hadoop-aws like Jets3tNativeFileSystemStore. The
hadoop-common stuff is part of Flink and Jets3tNativeFileSystemStore
is part of hadoop-aws. The big issue here is that other S3 FS
implementations might work with the aws-java-sdk packages of
3) Flink shades Hadoop's Guava dependency away and that's why you need
to add it manually to the CP.

So, if you go for the suggested NativeS3FileSystem, you end up needing
hadoop-aws and Guava. Of course, this might change in future versions
of Flink and/or Hadoop. I will update the docs for the different
versions of Flink and Hadoop for now and hope that this will help. :-(

The easiest solution in the future would be that Flink comes with
hadoop-aws, but I don't think that this is going to happen.

– Ufuk

[1] http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.6.0
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html#provide-s3-filesystem-dependency
[3] https://github.com/apache/flink/blob/master/flink-shaded-hadoop/pom.xml

View raw message