beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshua Fox (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-991) DatastoreIO Write should flush early for large batches
Date Sat, 19 Nov 2016 16:08:59 GMT

    [ https://issues.apache.org/jira/browse/BEAM-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679482#comment-15679482
] 

Joshua Fox edited comment on BEAM-991 at 11/19/16 4:08 PM:
-----------------------------------------------------------

The maximum request size is 10 Mb; the maximum Item size is 1 Mb. The implementation _must_
support all legal items. Solutions

- Set the maximum batch size to 10. That obviously reduces performance, but allows requests
to complete.
- Make the users set a constant  batch size between 10 and the max for Datastore-API, which
is 500. This is problematic, since we do not always know how big our Items are, particularly
if we are developing generic solutions.
- Start with  batch size 500. If that fails on a "too-large" error, the implementation then
recursively cuts the batch size in half and retries, until the _put_ succeeds. This new value
is then used for  a while. On the assumptions that Entities are grouped into similar sizes,
occasionally ramp up the batch size to see if the Entities are smaller, again reverting to
smaller batch size if there is a failure. Perhaps save the batch size, and ramp it up and
down, on a per-Kind basis.
- Measure _getSerializedSize()_ of _all_ Items on _every put_, and adjust batch size accordingly.
This may be slow.



was (Author: joshuafox):
The maximum request size is 10 Mb; the maximum Item size is 1 Mb. The implementation _must_
support all legal items. Solutions

- Set the maximum batch size to 10. That obviously reduces performance, but allows requests
to complete.
- Make the users set a constant  batch size between 10 and the max for Datastore-API, which
is 500. This is problematic, since we do not always know how big out Items are, particularly
if we are developing generic solutions.
- Start with  batch size 500. If that fails on a "too-large" error, the implementation then
recursively cuts the batch size in half and retries, until the _put_ succeeds. This new value
is then used for  a while. On the assumptions that Entities are grouped into similar sizes,
occasionally ramp up the batch size to see if the Entities are smaller, again reverting to
smaller batch size if there is a failure. Perhaps save the batch size, and ramp it up and
down, on a per-Kind basis.
- Measure _getSerializedSize()_ of _all_ Items on _every put_, and adjust batch size accordingly.
This may be slow.


> DatastoreIO Write should flush early for large batches
> ------------------------------------------------------
>
>                 Key: BEAM-991
>                 URL: https://issues.apache.org/jira/browse/BEAM-991
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-gcp
>            Reporter: Vikas Kedigehalli
>            Assignee: Vikas Kedigehalli
>
> If entities are large (avg size > 20KB) then the a single batched write (500 entities)
would exceed the Datastore size limit of a single request (10MB) from https://cloud.google.com/datastore/docs/concepts/limits.
> First reported in: http://stackoverflow.com/questions/40156400/why-does-dataflow-erratically-fail-in-datastore-access



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message