beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Halperin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (BEAM-51) Implement a CSV file reader
Date Thu, 25 Feb 2016 16:14:18 GMT

     [ https://issues.apache.org/jira/browse/BEAM-51?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Halperin updated BEAM-51:
--------------------------------
    Assignee:     (was: James Malone)

> Implement a CSV file reader
> ---------------------------
>
>                 Key: BEAM-51
>                 URL: https://issues.apache.org/jira/browse/BEAM-51
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>            Reporter: Daniel Halperin
>            Priority: Minor
>
> We should implement a CSV-based source.
> One possibility would be to support the same options as BigQuery. https://cloud.google.com/bigquery/preparing-data-for-bigquery#dataformats
These options are:
> fieldDelimiter: allowing a custom delimiter... csv vs tsv, etc. My guess is this is critical.
One common delimiter that people use is 'thorn' (รพ).
> quote: Custom quote char. By default, this is '"', but this allows users to set it to
something else, or, perhaps more commonly, remove it entirely (by setting it to the empty
string). For example, tab-separated files generally don't need quotes.
> allowQuotedNewlines: whether you can quote newlines. In the official CSV RFC, newlines
can be quoted.. that is, you can have "a", "b\n", "c" in a single line. This makes splitting
of large csv files impossible, so we should disallow quoted newlines by default unless the
user really wants them (in which case, they'll get worse performance).
> allowJaggedRows: This allows inferring null if not enough columns are specified. Otherwise
we give an error for the row.
> ignoreUnknownValues: The opposite of allowJaggedRows, this means that if a user has _too_
many values for the schema, we will ignore the ones we don't recognize, rather than reporting
an error for the row.
> skipHeaderRows: How many header lines are in the file.
> encoding: UTF8-vs latin1, etc.
> compression: gzip, bzip, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message