beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacob Marble (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem
Date Thu, 14 Sep 2017 04:58:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165391#comment-16165391
] 

Jacob Marble edited comment on BEAM-2500 at 9/14/17 4:57 AM:
-------------------------------------------------------------

I'm interested in implementing S3 support. Not being familiar Beam internals, and without
committing myself to anything, perhaps someone can comment on my research notes.

GCS is probably a good template. Implement FileSystem, ResourceId, FileSystemRegistrar, PipelineOptions,
PipelineOptionsRegistrar:
https://github.com/apache/beam/tree/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp

For interacting with S3, this is probably the preferred SDK:
http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3

Some specifics about implementing FileSystem:

FileSystem.copy()
- AmazonS3Client.copyObject((String sourceBucketName, String sourceKey, String destinationBucketName,
String destinationKey)
- max upload size is 5GB, which is probably fine to start, but need to use multipart upload
to get full 5TB limit

FileSystem.create()
- AmazonS3Client.putObject(putObject(String bucketName, String key, InputStream input, ObjectMetadata
metadata)
- max upload size is 5GB, which is probably fine to start, but need to use multipart upload
to get full 5TB limit

FileSystem.delete()
- AmazonS3Client.deleteObjects(DeleteObjectsRequest deleteObjectsRequest)

FileSystem.getScheme()
- return "s3"

FileSystem.match()
- j.o.apache.beam.sdk.extensions.util.gcsfs.GcsPath and same.GcsUtil have some good ideas

FileSystem.matchNewResource()
- Look at GcsPath and GcsUtil

FileSystem.open()
- AmazonS3Client.getObject(String bucketName, String key)

FileSystem.rename()
- Can't find anything in AmazonS3Client; perhaps call FileSystem.copy(), then FileSystem.delete()

I'm not clear about how to register the s3 FileSystem as mentioned in the FileSystemRegistrar
Javadoc:

"FileSystem creators have the ability to provide a registrar by creating a ServiceLoader entry
and a concrete implementation of this interface.

It is optional but recommended to use one of the many build time tools such as AutoService
to generate the necessary META-INF files automatically."


was (Author: jmarble):
I'm interested in implementing S3 support. Not being familiar Beam internals, and without
committing myself to anything, perhaps someone can comment on my research notes.

GCS is probably a good template. Implement FileSystem, ResourceId, FileSystemRegistrar, PathValidator,
PipelineOptions, PipelineOptionsRegistrar:
https://github.com/apache/beam/tree/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp

For interacting with S3, this is probably the preferred SDK:
http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3

Some specifics about implementing FileSystem:

FileSystem.copy()
- AmazonS3Client.copyObject((String sourceBucketName, String sourceKey, String destinationBucketName,
String destinationKey)
- max upload size is 5GB, which is probably fine to start, but need to use multipart upload
to get full 5TB limit

FileSystem.create()
- AmazonS3Client.putObject(putObject(String bucketName, String key, InputStream input, ObjectMetadata
metadata)
- max upload size is 5GB, which is probably fine to start, but need to use multipart upload
to get full 5TB limit

FileSystem.delete()
- AmazonS3Client.deleteObjects(DeleteObjectsRequest deleteObjectsRequest)

FileSystem.getScheme()
- return "s3"

FileSystem.match()
- j.o.apache.beam.sdk.extensions.util.gcsfs.GcsPath and same.GcsUtil have some good ideas

FileSystem.matchNewResource()
- Look at GcsPath and GcsUtil

FileSystem.open()
- AmazonS3Client.getObject(String bucketName, String key)

FileSystem.rename()
- Can't find anything in AmazonS3Client; perhaps call FileSystem.copy(), then FileSystem.delete()

I'm not clear about how to register the s3 FileSystem as mentioned in the FileSystemRegistrar
Javadoc:

"FileSystem creators have the ability to provide a registrar by creating a ServiceLoader entry
and a concrete implementation of this interface.

It is optional but recommended to use one of the many build time tools such as AutoService
to generate the necessary META-INF files automatically."

> Add support for S3 as a Apache Beam FileSystem
> ----------------------------------------------
>
>                 Key: BEAM-2500
>                 URL: https://issues.apache.org/jira/browse/BEAM-2500
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Luke Cwik
>            Priority: Minor
>         Attachments: hadoop_fs_patch.patch
>
>
> Note that this is for providing direct integration with S3 as an Apache Beam FileSystem.
> There is already support for using the Hadoop S3 connector by depending on the Hadoop
File System module[1], configuring HadoopFileSystemOptions[2] with a S3 configuration[3].
> 1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
> 2: https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L53
> 3: https://wiki.apache.org/hadoop/AmazonS3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message