beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matti Remes (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-2768) Fix bigquery.WriteTables generating non-unique job identifiers
Date Tue, 15 Aug 2017 20:41:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127852#comment-16127852
] 

Matti Remes commented on BEAM-2768:
-----------------------------------

{code:java}
    public static void loadRowsToBigQuery(String name, PCollection<TableRow> rows, DynamicDestinations<TableRow,
String> destination) {
        rows.apply(name, BigQueryIO.<TableRow>write()
                .withFormatFunction(new TableRowFormatter())
                .to(destination)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
    }

public class TableRowFormatter implements SerializableFunction<TableRow, TableRow> {
    @Override
    public TableRow apply(TableRow tableRow) {
        return tableRow;
    }
}

{code}

Apologies for the references, yes I was intending to point to the 2.0.0 source (I'm using
2.0.0).

The problem might be with the way the UUID is created and stored. Now the code states that
the generated UUID "will be used as the base for all load jobs issued from this instance of
the transform":
https://github.com/apache/beam/blob/v2.0.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L184

I can indeed confirm it from the logs that the job id is the same.

> Fix bigquery.WriteTables generating non-unique job identifiers
> --------------------------------------------------------------
>
>                 Key: BEAM-2768
>                 URL: https://issues.apache.org/jira/browse/BEAM-2768
>             Project: Beam
>          Issue Type: Bug
>          Components: beam-model
>    Affects Versions: 2.0.0
>            Reporter: Matti Remes
>            Assignee: Reuven Lax
>
> This is a result of BigQueryIO not creating unique job ids for batch inserts, thus BigQuery
API responding with a 409 conflict error:
> {code:java}
> Request failed with code 409, will NOT retry: https://www.googleapis.com/bigquery/v2/projects/<project_id>/jobs
> {code}
> The jobs are initiated in a step BatchLoads/SinglePartitionWriteTables, called by step's
WriteTables ParDo:
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L511-L521
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L148
> It would probably be a good idea to append a UUIDs as part of a job id.
> Edit: This is a major bug blocking using BigQuery as a sink for bounded input.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message