beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reuven Lax (JIRA)" <>
Subject [jira] [Commented] (BEAM-2858) temp file garbage collection in BigQuery sink should be in a separate DoFn
Date Tue, 12 Sep 2017 19:36:00 GMT


Reuven Lax commented on BEAM-2858:

I just reproduced this and verified it does not cause data loss. The load job fails (on qeuery)
with a 409. The message is

Error encountered during job execution:
Not found: URI gs://bigquery_beam_testing_regional/temp/BigQueryWriteTemp/c7fb6a3d06fa4ceab662f83488cc6d31/c5db57f8-9cc0-4cad-9a7e-9c56cb572177

However this is still a critical bug. Streaming jobs get blocked forever, because the job
fails on every retry. Batch jobs will retry this several times and then fail.

> temp file garbage collection in BigQuery sink should be in a separate DoFn
> --------------------------------------------------------------------------
>                 Key: BEAM-2858
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-gcp
>    Affects Versions: 2.1.0
>            Reporter: Reuven Lax
>            Assignee: Reuven Lax
>             Fix For: 2.2.0
>         Attachments: delete_file_diff.txt
> Currently the WriteTables transform deletes the set of input files as soon as the load()
job completes. However this is incorrect - if the task fails partially through deleting files
(e.g. if the worker crashes), the task will be retried. If the write disposition is WRITE_TRUNCATE,
bad things could result.
> The resulting behavior will depend on what BQ does if one of input files is missing (because
we had previously deleted it). In the best case, BQ will fail the load. In this case the step
will keep failing until the runner finally fails the entire job. If however BQ ignores the
missing file, the load will overwrite the previously-written table with the smaller set of
files and the job will succeed. This is the worst-case scenario, as it will result in data

This message was sent by Atlassian JIRA

View raw message