beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Demeshchuk (JIRA)" <>
Subject [jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK
Date Thu, 13 Jul 2017 22:37:00 GMT


Dmitry Demeshchuk commented on BEAM-2572:

re 1: I just don't want us to end up in a situation like this:

List: We just released an S3 filesystem! Please use it and tell us what you think!
User7231: Hi, how do I provide credentials for the filesystem, in case I run my stuff on Dataflow?
List: Just set up envrionment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY on your
Dataflow nodes!
User7231: Cool, how can I do that?
List: Well, there's no official way, so you just hack yourself a custom package, or something
like that!

We only have two runners for Python right now: Direct and Dataflow. I think it would make
sense to make things runnable in Dataflow too, even if configuring the environment is going
to be a Dataflow-specific mechanism, totally independent from Beam. What worries me about
making it a Dataflow feature is that the whole Beam S3 feature will become dependent on the
Dataflow planning and release cycle, before it can be somewhat usable to people.

re 2, 3: That's a good point. FWIW, I'm all in for reducing the scope and complexity of this
feature. Would rather have a non-ideal solution in a month, than an ideal solution someday.

I apologize for dragging this conversation so far out, there just seems no clear consensus
on the subject, and I really want this to be usable beyond just the direct runner.

> Implement an S3 filesystem for Python SDK
> -----------------------------------------
>                 Key: BEAM-2572
>                 URL:
>             Project: Beam
>          Issue Type: Task
>          Components: sdk-py
>            Reporter: Dmitry Demeshchuk
>            Assignee: Ahmet Altay
>            Priority: Minor
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore their behaviors
may contradict each other in some edge cases (say, we write something to S3, but it's not
immediately accessible for reading from another end).
> 2. There are other AWS-based sources and sinks we may want to create in the future: DynamoDB,
Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like reattempting.
> Whatever path we choose, there's another problem related to this: we currently cannot
pass any global settings (say, pipeline options, or just an arbitrary kwarg) to a filesystem.
Because of that, we'd have to setup the runner nodes to have AWS keys set up in the environment,
which is not trivial to achieve and doesn't look too clean either (I'd rather see one single
place for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem implementation
that only supports DirectRunner at the moment (because of the previous paragraph). I'm perfectly
fine finishing it myself, with some guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!

This message was sent by Atlassian JIRA

View raw message