beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (JIRA)" <>
Subject [jira] [Updated] (BEAM-2776) TextIO should support reading header lines
Date Thu, 25 Jan 2018 22:28:00 GMT


Eugene Kirpichov updated BEAM-2776:
    Priority: Minor  (was: Major)

> TextIO should support reading header lines
> ------------------------------------------
>                 Key: BEAM-2776
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core, sdk-py-core
>            Reporter: Eugene Kirpichov
>            Priority: Minor
> Users frequently request the ability to skip some header rows when reading text files.
> This is also relevant for reading file formats such as VCF, see thread
> Python supports this partially via skip_header_lines,
but the header lines can have useful content, and the number of header lines is not fixed
(in VCF).
> We should figure out a good API for this and support this natively in TextIO. The API
decisions would be:
> - How do we specify how much of the beginning of each file is the header: options could
be e.g. a certain number of lines; or lines that start with a certain character; or a custom
> - How do we make the header contents accessible to a user of TextIO. Since the header
can be different in each file, we can't return it as a PCollectionView<List<String>>.
Instead I suppose, when you use a header, you'd need to specify a SerializableFunction<KV<List<String>,
String>, T> or something like that for parsing (header, line) -> user type. Note
that currently TextIO.Read does not support returning a user type anyway, so that'd need to
be done too.

This message was sent by Atlassian JIRA

View raw message