beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Jones <andrew+b...@andrew-jones.com>
Subject WriteTOBigQuery/BatchLoads/ReifyResults step taking hours
Date Sat, 03 Mar 2018 11:30:24 GMT
Hi,

We have a Dataflow job that loads data from GCS, does a bit of transformation, then writes
to a number of BigQuery tables using DynamicDestinations.

The same job runs on smaller data sets (~70 million records), but this one is struggling when
processing ~500 million records. Both jobs are writing to the same amount of tables - the
only difference is the amount of records.

Example job IDs include 2018-03-02_04_29_44-2181786949469858712 and 2018-03-02_08_46_28-4580218739500768796.
They are using BigQuery.IO to write to BigQuery, using the BigQueryIO.Write.Method.FILE_LOADS
method (the default for a bounded job). They successfully stage all their data to GCS, but
then for some reason scale down the amount of workers to 1 when processing the step WriteTOBigQuery/BatchLoads/ReifyResults
and stay in that step for hours.

In the logs we see many entries like this:

Proposing dynamic split of work unit ...-7e07;2018-03-02_04_29_44-2181786949469858712;662185752552586455
at {"fractionConsumed":0.5}
Rejecting split request because custom reader returned null residual source.
And also occasionally this:

Processing lull for PT24900.038S in state process of WriteTOBigQuery/BatchLoads/ReifyResults/ParDo(Anonymous)
at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141)
...

The job does seem to eventually progress, but after many hours. It then fails later with this
error, which may or may not be related (just starting to look in to):

(94794e1a2c96f380): java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException:
java.io.IOException: Unable to patch table description: {datasetId=..., projectId=..., tableId=9c20908cc6e549b4a1e116af54bb8128_011249028ddcc5204885bff04ce2a725_00001_00000},
aborting after 9 retries.

We're not sure how to proceed, so any pointers would be appreciated.

Thanks,
Andrew

Mime
View raw message