beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-3067) BigQueryIO.Write fails on empty PCollection with DirectRunner (batch job)
Date Fri, 03 Nov 2017 20:47:01 GMT

    [ https://issues.apache.org/jira/browse/BEAM-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238341#comment-16238341
] 

Eugene Kirpichov commented on BEAM-3067:
----------------------------------------

This was fixed as a side effect of https://github.com/apache/beam/pull/3863 and this bug is
not present in Beam 2.2.

> BigQueryIO.Write fails on empty PCollection with DirectRunner (batch job)
> -------------------------------------------------------------------------
>
>                 Key: BEAM-3067
>                 URL: https://issues.apache.org/jira/browse/BEAM-3067
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-direct, sdk-java-gcp
>    Affects Versions: 2.1.0
>         Environment: Arch Linux, Java 1.8.0_144
>            Reporter: Dmitry Bigunyak
>            Assignee: Thomas Groh
>            Priority: Major
>
> I'm using side output feature to filter out malformatted events (errors) from a stream
of valid events. Then I save valid events into one BigQuery table and errors go into another
dedicated table.
> Here is the code for outputting error rows:
> {code:java}
> invalidEventRows.apply("WriteErrors", BigQueryIO.writeTableRows()
>         .to(errorTableRef)
>         .withSchema(ProcessEvents.getErrorSchema())
>         .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
>         .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
> {code}
> The problem is that when running on DirectRunner in a batch mode (reading input from
a file) and {{invalidEventRows}} PCollection ends up being empty (all events are valid --
no errors), I get the following error:
> {code}
> [ERROR]   "status" : {
> [ERROR]     "errorResult" : {
> [ERROR]       "message" : "No schema specified on job or table.",
> [ERROR]       "reason" : "invalid"
> [ERROR]     },
> [ERROR]     "errors" : [ {
> [ERROR]       "message" : "No schema specified on job or table.",
> [ERROR]       "reason" : "invalid"
> [ERROR]     } ],
> [ERROR]     "state" : "DONE"
> [ERROR]   },
> {code}
> There are no errors when executing the same code and {{invalidEventRows}} PCollection
is not empty, the BigQuery table is created and the data are correctly inserted.
> Also everything seems to be working fine in a streaming mode (reading from Pub/Sub) on
both DirectRunner and DataflowRunner.
> Looks like a bug?
> Or should I open an issue in GoogleCloudPlatform/DataflowJavaSDK github project?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message