Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 22F74200CC5 for ; Tue, 11 Jul 2017 22:23:43 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 224721671A1; Tue, 11 Jul 2017 20:23:43 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1EC6C1670F5 for ; Tue, 11 Jul 2017 22:23:41 +0200 (CEST) Received: (qmail 5182 invoked by uid 500); 11 Jul 2017 20:23:41 -0000 Mailing-List: contact commits-help@beam.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@beam.apache.org Delivered-To: mailing list commits@beam.apache.org Received: (qmail 5172 invoked by uid 99); 11 Jul 2017 20:23:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Jul 2017 20:23:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id BDDD51960A3 for ; Tue, 11 Jul 2017 20:23:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id z1ELDlD2S-9V for ; Tue, 11 Jul 2017 20:23:40 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 69BA4626CA for ; Tue, 11 Jul 2017 20:17:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id DE5DDE0D93 for ; Tue, 11 Jul 2017 20:17:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 52307246DD for ; Tue, 11 Jul 2017 20:17:00 +0000 (UTC) Date: Tue, 11 Jul 2017 20:17:00 +0000 (UTC) From: "Sourabh Bajaj (JIRA)" To: commits@beam.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 11 Jul 2017 20:23:43 -0000 [ https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082887#comment-16082887 ] Sourabh Bajaj commented on BEAM-2572: ------------------------------------- I think overall this plan makes sense but step 4 might be of unclear difficulty here as it depends on the runners propagating the pipeline options to all the workers correctly. It would be great to start a thread around this because we don't have a very clear story for credential passing to PTransforms yet. > Implement an S3 filesystem for Python SDK > ----------------------------------------- > > Key: BEAM-2572 > URL: https://issues.apache.org/jira/browse/BEAM-2572 > Project: Beam > Issue Type: Task > Components: sdk-py > Reporter: Dmitry Demeshchuk > Assignee: Ahmet Altay > Priority: Minor > > There are two paths worth exploring, to my understanding: > 1. Sticking to the HDFS-based approach (like it's done in Java). > 2. Using boto/boto3 for accessing S3 through its common API endpoints. > I personally prefer the second approach, for a few reasons: > 1. In real life, HDFS and S3 have different consistency guarantees, therefore their behaviors may contradict each other in some edge cases (say, we write something to S3, but it's not immediately accessible for reading from another end). > 2. There are other AWS-based sources and sinks we may want to create in the future: DynamoDB, Kinesis, SQS, etc. > 3. boto3 already provides somewhat good logic for basic things like reattempting. > Whatever path we choose, there's another problem related to this: we currently cannot pass any global settings (say, pipeline options, or just an arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the runner nodes to have AWS keys set up in the environment, which is not trivial to achieve and doesn't look too clean either (I'd rather see one single place for configuring the runner options). > Also, it's worth mentioning that I already have a janky S3 filesystem implementation that only supports DirectRunner at the moment (because of the previous paragraph). I'm perfectly fine finishing it myself, with some guidance from the maintainers. > Where should I move on from here, and whose input should I be looking for? > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)