beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Halperin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (BEAM-73) IO design pattern: Decouple Parsers and Coders
Date Thu, 20 Oct 2016 18:24:58 GMT

     [ https://issues.apache.org/jira/browse/BEAM-73?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Halperin updated BEAM-73:
--------------------------------
    Labels: backward-incompatible  (was: )

> IO design pattern: Decouple Parsers and Coders
> ----------------------------------------------
>
>                 Key: BEAM-73
>                 URL: https://issues.apache.org/jira/browse/BEAM-73
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Daniel Halperin
>            Priority: Minor
>              Labels: backward-incompatible
>
> Many Sources can be thought of as providing a byte[] payload -- e.g. TextIO bytes between
newlines, or PubSubIO messages. Therefore, we originally suggested a Coder as the thing to
use to decode these byte[] into T (what I'll call Parsing).
> Consider the case of a text file of integers.
> 123\n
> 456\n
> ...
> We want a PCollection<Integer> out, so we can use TextualIntegerCoder with TextIO.Read.
However, that Coder will get propagated as the default coder for that PCollection (and may
be used in downstream DoFns). This seem bad as, once the data is parsed, we probably want
to use VarIntCoder or another Coder that is more CPU- and Space-efficient.
> Another design pattern is
>     TextIO.Read() -> MapElements<String, Integer> (lambda s : Integer.parseInt(s))
> This has better behavior, but now we go from byte[] to String to Integer rather than
directly from byte[] to Integer.
> The solution seems to be to explicitly add Parser and Coder abstractions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message