beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dhalperi <...@git.apache.org>
Subject [GitHub] beam pull request #2779: [BEAM-59] Convert WriteFiles/FileBasedSink from IOC...
Date Sat, 29 Apr 2017 03:56:20 GMT
GitHub user dhalperi opened a pull request:

    https://github.com/apache/beam/pull/2779

    [BEAM-59] Convert WriteFiles/FileBasedSink from IOChannelFactory to FileSystems

    This converts FileBasedSink from IOChannelFactory to FileSystems, with
    fallout changes on all existing Transforms that use WriteFiles.
    
    We preserve the existing semantics of most transforms, simply adding the
    ability for users to provide ResourceId in addition to String when
    setting the outputPrefix.
    
    Other changes:
    
    * Make DefaultFilenamePolicy its own top-level class and move
      IOChannelUtils#constructName into it. This the default FilenamePolicy
      used by FilebasedSource.
    
    * Rethink FilenamePolicy as a function from ResourceId (base directory)
      to ResourceId (output file), moving the base directory into the
      context. This way, FilenamePolicy logic is truly independent from the
      base directory. Using ResourceId#resolve, a filename policy can add
      multiple path components, say, base/YYYY/MM/DD/file.txt, in a
      fileystem independent way.
    
      (Also add an optional extension parameter to the function, enabling an
      owning transform to pass in the suffix from a separately-configured
      compression factory or similar.)
    
    * Remove some old logic disallowing certain specific patterns of
      filenames that dates back to Cloud Dataflow SDKs on no-longer-used
      implementations.
    
    ----
    
    TODO:
    
    - [ ] I cleaned up TextIO and AvroIO, but XmlIO and TFRecordIO need more.
    - [ ] Review test coverage.
    - [ ] REALLY review testing and javadoc.
    
    But getting this out to be able to look at the comprehensive diff.
    
    CC: @davorbonaci @lukecwik @vikkyrk @jkff @reuvenlax 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dhalperi/beam convert-file-based-sink

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/2779.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2779
    
----
commit 1897a8756069237836a745ddaf38e9a0692db186
Author: Dan Halperin <dhalperi@google.com>
Date:   2017-04-25T17:10:28Z

    Convert WriteFiles/FileBasedSink from IOChannelFactory to FileSystems
    
    This converts FileBasedSink from IOChannelFactory to FileSystems, with
    fallout changes on all existing Transforms that use WriteFiles.
    
    We preserve the existing semantics of most transforms, simply adding the
    ability for users to provide ResourceId in addition to String when
    setting the outputPrefix.
    
    Other changes:
    
    * Make DefaultFilenamePolicy its own top-level class and move
      IOChannelUtils#constructName into it. This the default FilenamePolicy
      used by FilebasedSource.
    
    * Rethink FilenamePolicy as a function from ResourceId (base directory)
      to ResourceId (output file), moving the base directory into the
      context. This way, FilenamePolicy logic is truly independent from the
      base directory. Using ResourceId#resolve, a filename policy can add
      multiple path components, say, base/YYYY/MM/DD/file.txt, in a
      fileystem independent way.
    
      (Also add an optional extension parameter to the function, enabling an
      owning transform to pass in the suffix from a separately-configured
      compression factory or similar.)
    
    * Remove some old logic disallowing certain specific patterns of
      filenames that dates back to Cloud Dataflow SDKs on no-longer-used
      implementations.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message