beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent Spiewak (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-2840) BigQueryIO write is slow/fail with a bounded source
Date Thu, 07 Sep 2017 08:48:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156680#comment-16156680
] 

Vincent Spiewak edited comment on BEAM-2840 at 9/7/17 8:47 AM:
---------------------------------------------------------------

I use DataflowRunner with 1 to 5 n1-standard-4 instances (4 vCPU, 15 GB Ram)

I read from BQ, do some transforms, then output to BQ (dated tables) + ES 5


was (Author: vspiewak):
I use DataflowRunner with 1 to 5 n1-standard-4 instances (4 vCPU, 15 GB Ram)

> BigQueryIO write is slow/fail with a bounded source
> ---------------------------------------------------
>
>                 Key: BEAM-2840
>                 URL: https://issues.apache.org/jira/browse/BEAM-2840
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-gcp
>    Affects Versions: 2.0.0
>         Environment: Gougle Cloud Platform
>            Reporter: Vincent Spiewak
>            Assignee: Reuven Lax
>         Attachments: PrepareWrite.BatchLoads.png
>
>
> BigQueryIO Writer is slow / fail if the input source is bounded.
> EDIT: Input BQ: 294 GB, 741,896,827 events
> If the input source is bounded (GCS / BQ select / ...), BigQueryIO Writer use the "[Method.FILE_LOADS|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1168]"
instead of streaming inserts.
> Large amounts of input datas result in a  java.lang.OutOfMemoryError / Java heap space
(500 millions rows).
> !PrepareWrite.BatchLoads.png|thumbnail!
> We cannot use "Method.STREAMING_INSERTS" or control the batchs sizes since
> [withMaxFilesPerBundle|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1131]
is private :(
> Someone reported a similar problem with GCS -> BQ on Stackoverflow: 
> [Why is writing to BigQuery from a Dataflow/Beam pipeline slow?|https://stackoverflow.com/questions/45889992/why-is-writing-to-bigquery-from-a-dataflow-beam-pipeline-slow#comment78954153_45889992]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message