beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jkff <...@git.apache.org>
Subject [GitHub] beam pull request #3443: [BEAM-2511] Implements TextIO.ReadAll
Date Mon, 26 Jun 2017 22:49:07 GMT
GitHub user jkff opened a pull request:

    https://github.com/apache/beam/pull/3443

    [BEAM-2511] Implements TextIO.ReadAll

    Reads a PCollection of filenames. Part of the plan at http://s.apache.org/textio-sdf.
Currently implemented pretty naively, and without SDF: expands glob, splits each file into
64MB chunks, reads each chunk using existing TextReader code. Pretty trivial, except had to
duplicate code for managing compression - but this is tested by adding a ReadAll test to every
Read test.
    
    This won't advance the watermark very well because the chunks are unordered. However hopefully
in streaming pipelines people will be ingesting PCollection's of small-ish files and this
won't matter much. And TextIO doesn't report timestamps of elements anyway, so in fact it
doesn't matter at all. One of the next steps is to develop also an SDF version of this, and
have runners that support SDF use it via an override.
    
    R: @reuvenlax 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkff/incubator-beam textio-read-all

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/3443.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3443
    
----
commit f7f0f1e4e9105a678524894a2304520541359d33
Author: Eugene Kirpichov <kirpichov@google.com>
Date:   2017-06-24T01:01:53Z

    Splits large TextIOTest into TextIOReadTest and TextIOWriteTest

commit 79ae1e8d4bbe92fad06837555db471368007bd45
Author: Eugene Kirpichov <kirpichov@google.com>
Date:   2017-06-24T01:02:10Z

    Adds TextIO.readAll(), implemented rather naively

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message