beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Hebert (JIRA)" <j...@apache.org>
Subject [jira] [Created] (BEAM-2586) Accommodate custom delimiters in TextIO
Date Tue, 11 Jul 2017 18:31:00 GMT
Christopher Hebert created BEAM-2586:
----------------------------------------

             Summary: Accommodate custom delimiters in TextIO
                 Key: BEAM-2586
                 URL: https://issues.apache.org/jira/browse/BEAM-2586
             Project: Beam
          Issue Type: New Feature
          Components: sdk-java-core
            Reporter: Christopher Hebert
            Assignee: Davor Bonaci
            Priority: Minor


We frequently process text files delimited by something other than newlines, including delimited
only by end of file.

First option:
When we want to delimit by commas (or something else), we could use TextIO to read in line
by line and apply a transform to split each line on commas. When we want to delimit by whole
file, we could combine the elements of the PCollection output from TextIO that come from the
same file into one element.

Second option:
Alternatively to complicating (and slowing) our pipelines with the methods above, we could
write custom FileBasedSources for each use case.

Third option:
Preferably, we'd like to generalize TextIO to accept delimiters other than the default: \n,
\r, \r\n.

I'll attach a pull request for how we envision this generalization of TextIO to look.

If this is not the direction Beam would like to go with TextIO, then we'll stick to maintaining
our own TextIO or our own FileBasedSources to achieve this functionality.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message