beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Demeshchuk (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK
Date Thu, 13 Jul 2017 00:39:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084978#comment-16084978
] 

Dmitry Demeshchuk commented on BEAM-2572:
-----------------------------------------

I agree that resource access is a filesystem problem, indeed.

Let's discuss each approach, then:

> Give file systems access to the pipeline options.

1. Do we somehow allow using separate credentials for reading and writing? Do we allow using
separate credentials for different paths (say, different S3 buckets)?
2. Let's say, for sake of simplicity, that we are using a single set of credentials in the
pipeline options for the S3 filesystem. How do we provide AWS credentials to other, non-filesystem,
sources and sinks (Dynamo, Redshift, Kinesis, SQS, etc)? Do we use pipeline options too, for
sake of consistency? Or do we provide AWS credentials as parameters to each of those PTransforms?

> File systems should be capable of acquiring credentials only by using the environment
and pipeline options. (This could range from explicitly getting credentials in the options
to, using a KMS)

3. How do we provision environment to the runner nodes? Sounds like this will have a be a
runner-specific feature. Besides, some cloud provides (say, GCP) would rather require us to
provide credentials in a file, not in environment.
4. If we use KMS or such, should we introduce a notion of secrets storage to Beam? While I
think that's an option, it doesn't seem to me drastically different from just using pipeline
options with a few extra security measures. Besides, it will likely increase the infrastructural
complexity of setup.

(Please tell me anytime if we should move this discussion to dev@)

> Implement an S3 filesystem for Python SDK
> -----------------------------------------
>
>                 Key: BEAM-2572
>                 URL: https://issues.apache.org/jira/browse/BEAM-2572
>             Project: Beam
>          Issue Type: Task
>          Components: sdk-py
>            Reporter: Dmitry Demeshchuk
>            Assignee: Ahmet Altay
>            Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore their behaviors
may contradict each other in some edge cases (say, we write something to S3, but it's not
immediately accessible for reading from another end).
> 2. There are other AWS-based sources and sinks we may want to create in the future: DynamoDB,
Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like reattempting.
> Whatever path we choose, there's another problem related to this: we currently cannot
pass any global settings (say, pipeline options, or just an arbitrary kwarg) to a filesystem.
Because of that, we'd have to setup the runner nodes to have AWS keys set up in the environment,
which is not trivial to achieve and doesn't look too clean either (I'd rather see one single
place for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem implementation
that only supports DirectRunner at the moment (because of the previous paragraph). I'm perfectly
fine finishing it myself, with some guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message