beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-48) BigQueryIO.Read reimplemented as BoundedSource
Date Tue, 02 May 2017 17:39:04 GMT

    [ https://issues.apache.org/jira/browse/BEAM-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993368#comment-15993368
] 

ASF GitHub Bot commented on BEAM-48:
------------------------------------

GitHub user dhalperi opened a pull request:

    https://github.com/apache/beam/pull/2832

    [BEAM-48] BigQuery: swap from asSingleton to asIterable for Cleanup

    asIterable can be simpler for runners to implement as it does not require semantically
    that the PCollection being viewed contains exactly one element.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dhalperi/beam bq-singleton

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/2832.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2832
    
----

----


> BigQueryIO.Read reimplemented as BoundedSource
> ----------------------------------------------
>
>                 Key: BEAM-48
>                 URL: https://issues.apache.org/jira/browse/BEAM-48
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-gcp
>            Reporter: Daniel Halperin
>            Assignee: Pei He
>             Fix For: 0.1.0-incubating
>
>
> BigQueryIO.Read is currently implemented in a hacky way: the DirectPipelineRunner streams
all rows in the table or query result directly using the JSON API, in a single-threaded manner.
> In contrast, the DataflowPipelineRunner uses an entirely different code path implemented
in the Google Cloud Dataflow service. (A BigQuery export job to GCS, followed by a parallel
read from GCS).
> We need to reimplement BigQueryIO as a BoundedSource in order to support other runners
in a scalable way.
> I additionally suggest that we revisit the design of the BigQueryIO source in the process.
A short list:
> * Do not use TableRow as the default value for rows. It could be Map<String, Object>
with well-defined types, for example, or an Avro GenericRecord. Dropping TableRow will get
around a variety of issues with types, fields named 'f', etc., and it will also reduce confusion
as we use TableRow objects differently than usual (for good reason).
> * We could also directly add support for a RowParser to a user's POJO.
> * We should expose TableSchema as a side output from the BigQueryIO.Read.
> * Our builders for BigQueryIO.Read are useful and we should keep them. Where possible
we should also allow users to provide the JSON objects that configure the underlying intermediate
tables, query export, etc. This would let users directly control result flattening, location
of intermediate tables, table decorators, etc., and also optimistically let users take advantage
of some new BigQuery features without code changes.
> * We could use switch between whether we use a BigQuery export + parallel scan vs API
read based on factors such as the size of the table at pipeline construction time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message