beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (BEAM-2700) BigQueryIO should support using file load jobs when using unbounded collections
Date Sun, 30 Jul 2017 18:49:00 GMT


ASF GitHub Bot commented on BEAM-2700:

GitHub user reuvenlax opened a pull request:

    [BEAM-2700] Support load jobs in streaming

    Allow BigQuery load jobs to be selected by the user even when using unbounded PCollections.
If using unbounded PCollections, the user must specify a frequency indicating how often these
load jobs will be generated.
    Note: while there are some similarities between the BigQuery transform and what is done
in FileBasedSink, there are a enough differences that it does not appear easy or advisable
to attempt to reuse the code.
    Note: a design choice is to only allow the user to specify a triggering frequency, not
arbitrary windows. The reason is that this triggering frequency is merely a tuning parameter
controlling the BigQuery load jobs and is usually set to keep the number of BQ load jobs under
quota (ideally it wouldn't even be needed, however I don't know how to make this automatic
and respect user quotas). There is no need for semantic windowing to control how often these
writes happen.

You can merge this pull request into a Git repository by running:

    $ git pull bq_load_jobs_in_streaming

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3662
commit 83fccf0cecb2b5eff1d4b814597c85256f2773f0
Author: Reuven Lax <>
Date:   2017-07-30T18:17:39Z

    Allow users to choose the BigQuery insertion method. If choosing file load jobs on an
unbounded PCollection, a triggering frequency must be specified to control how often load
jobs are generated.

commit 128984b00bb42782767ee34c74f3c6b234b83d93
Author: Reuven Lax <>
Date:   2017-07-30T18:36:12Z



> BigQueryIO should support using file load jobs when using unbounded collections
> -------------------------------------------------------------------------------
>                 Key: BEAM-2700
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-gcp
>    Affects Versions: 2.2.0
>            Reporter: Reuven Lax
>            Assignee: Reuven Lax
> Currently the method used for inserting into BigQuery is based on the input PCollection.
Bounded input using file load jobs, unbounded input uses streaming inserts. However while
streaming inserts have far lower latency, then cost quite a bit more and they provide weaker
consistency guarantees. Users should be able to choose which method to use, irrespective of
the input PCollection.

This message was sent by Atlassian JIRA

View raw message