beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Halperin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (BEAM-60) FileBasedSource/IOChannelFactory: Custom glob expansion
Date Thu, 25 Feb 2016 16:08:18 GMT
Daniel Halperin created BEAM-60:
-----------------------------------

             Summary: FileBasedSource/IOChannelFactory: Custom glob expansion
                 Key: BEAM-60
                 URL: https://issues.apache.org/jira/browse/BEAM-60
             Project: Beam
          Issue Type: New Feature
          Components: sdk-java-core
            Reporter: Daniel Halperin
            Assignee: Davor Bonaci


Many cloud and distributed filesystems are eventually consistent, for instance Amazon s3 and
Google Cloud Storage.

To work around this, many systems that produce files such as Beam's FileBasedSinks, or Google
BigQuery will provide methods to determine the number and set of files produced. E.g.,

* Beam FileBasedSink uses -00000-of-NNNNN
* BigQuery export jobs uses -000000 -000001 -000002 ... until an empty file is produced
* Another system may produce a .filelist suffix that contains a list of all files.

Users should be able to supply a glob to FileBasedSource but additionally supply a "glob expander"
that can provide a custom implementation for file expansion. That way, e.g., Beam pipelines
can be run back-to-back-to-back where each consumes the output of the previous, on an inconsistent
filesystem, without data loss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message