beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chamikara Jayalath (JIRA)" <j...@apache.org>
Subject [jira] [Created] (BEAM-2643) Add TextIO.read_all() to Python SDK
Date Wed, 19 Jul 2017 18:43:00 GMT
Chamikara Jayalath created BEAM-2643:
----------------------------------------

             Summary: Add TextIO.read_all() to Python SDK
                 Key: BEAM-2643
                 URL: https://issues.apache.org/jira/browse/BEAM-2643
             Project: Beam
          Issue Type: New Feature
          Components: sdk-py
            Reporter: Chamikara Jayalath


Java SDK now has TextIO.read_all() API that allows reading a massive number of files by moving
from using the BoundedSource API (which may perform expensive source operations on the control
plane) to using ParDo operations.

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L170

This API should be added for Python SDK as well.

This form of reading files does not support dynamic work rebalancing for now. But this should
not matter much when reading a massive number of relatively small files. In the future this
API can support dynamic work rebalancing through Splittable DoFn.

cc: [~jkff]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message