beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Hoem Rosbach (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-3039) DatastoreIO.Write fails multiple mutations of same entity
Date Tue, 17 Oct 2017 18:28:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208098#comment-16208098
] 

Alexander Hoem Rosbach commented on BEAM-3039:
----------------------------------------------

Would it offer any advantage to use GroupByKey instead of Distinct?

What do you think about including features in the DatastoreIO to avoid the issue? It could
be optional parameters passed to the write-function if you don't agree that it is a bug. In
my opinion it is a bug that what I would assume is a common use case for dataflow implementations,
streaming data from pubsub into datastore.

For instance:
{code}
.apply(DatastoreIO.v1().write().withProjectId(options.getProject()).removeDuplicatesWithinCommits());
{code}

> DatastoreIO.Write fails multiple mutations of same entity
> ---------------------------------------------------------
>
>                 Key: BEAM-3039
>                 URL: https://issues.apache.org/jira/browse/BEAM-3039
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-extensions
>    Affects Versions: 2.1.0
>            Reporter: Alexander Hoem Rosbach
>            Assignee: Chamikara Jayalath
>            Priority: Minor
>
> When streaming messages from a source that doesn't guarantee once-only-delivery, but
has at-least-once-delivery, then the DatastoreIO.Write will throw an exception which leads
to Dataflow retrying the same commit multiple times before giving up. This leads to a significant
bottleneck in the pipeline, with the end-result that the data is dropped. This should be handled
better.
> There are a number of ways to fix this. One of them could be to drop any duplicate mutations
within one batch. Non-duplicates should also be handled in some way. Perhaps a use NON-TRANSACTIONAL
commit, or make sure the mutations are commited in different commits.
> {code}
> com.google.datastore.v1.client.DatastoreException: A non-transactional commit may not
contain multiple mutations affecting the same entity., code=INVALID_ARGUMENT
>         com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
>         com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:169)
>         com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:89)
>         com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
>         org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:1288)
>         org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.processElement(DatastoreV1.java:1253)

> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message