Return-Path: X-Original-To: apmail-beam-commits-archive@minotaur.apache.org Delivered-To: apmail-beam-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 90D78182C3 for ; Thu, 25 Feb 2016 16:15:01 +0000 (UTC) Received: (qmail 46253 invoked by uid 500); 25 Feb 2016 16:08:21 -0000 Delivered-To: apmail-beam-commits-archive@beam.apache.org Received: (qmail 46208 invoked by uid 500); 25 Feb 2016 16:08:21 -0000 Mailing-List: contact commits-help@beam.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@beam.incubator.apache.org Delivered-To: mailing list commits@beam.incubator.apache.org Received: (qmail 46199 invoked by uid 99); 25 Feb 2016 16:08:21 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Feb 2016 16:08:21 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 210831A0032 for ; Thu, 25 Feb 2016 16:08:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.349 X-Spam-Level: X-Spam-Status: No, score=-4.349 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.329] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id KKAP0lUo-TYP for ; Thu, 25 Feb 2016 16:08:20 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 67A865FB73 for ; Thu, 25 Feb 2016 16:08:19 +0000 (UTC) Received: (qmail 46073 invoked by uid 99); 25 Feb 2016 16:08:18 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Feb 2016 16:08:18 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 3010F2C1F62 for ; Thu, 25 Feb 2016 16:08:18 +0000 (UTC) Date: Thu, 25 Feb 2016 16:08:18 +0000 (UTC) From: "Daniel Halperin (JIRA)" To: commits@beam.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (BEAM-60) FileBasedSource/IOChannelFactory: Custom glob expansion MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Daniel Halperin created BEAM-60: ----------------------------------- Summary: FileBasedSource/IOChannelFactory: Custom glob expansion Key: BEAM-60 URL: https://issues.apache.org/jira/browse/BEAM-60 Project: Beam Issue Type: New Feature Components: sdk-java-core Reporter: Daniel Halperin Assignee: Davor Bonaci Many cloud and distributed filesystems are eventually consistent, for instance Amazon s3 and Google Cloud Storage. To work around this, many systems that produce files such as Beam's FileBasedSinks, or Google BigQuery will provide methods to determine the number and set of files produced. E.g., * Beam FileBasedSink uses -00000-of-NNNNN * BigQuery export jobs uses -000000 -000001 -000002 ... until an empty file is produced * Another system may produce a .filelist suffix that contains a list of all files. Users should be able to supply a glob to FileBasedSource but additionally supply a "glob expander" that can provide a custom implementation for file expansion. That way, e.g., Beam pipelines can be run back-to-back-to-back where each consumes the output of the previous, on an inconsistent filesystem, without data loss. -- This message was sent by Atlassian JIRA (v6.3.4#6332)