beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Work logged] (BEAM-2660) Set PubsubIO batch size using builder
Date Wed, 08 Aug 2018 21:24:00 GMT

     [ https://issues.apache.org/jira/browse/BEAM-2660?focusedWorklogId=132645&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-132645
]

ASF GitHub Bot logged work on BEAM-2660:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Aug/18 21:23
            Start Date: 08/Aug/18 21:23
    Worklog Time Spent: 10m 
      Work Description: cjmcgraw commented on issue #3619: [BEAM-2660] Set PubsubIO batch
size using builder
URL: https://github.com/apache/beam/pull/3619#issuecomment-411557232
 
 
   currently my company is using this as a batch for loading prediction tuples in fast batch.
We are using this in Dataflow as we speak, and have been since this fork was created. Our
use case most likely won't need to be streaming. So the change is effective for my problem.
   
   That being said I am not fully groking the issue here. I'd like to get clarity for when/if
someone stumbles across this in the future.
   
   @dadrian 
   > What? For one, this PR doesn't touch the source, just the sink. Second, if that's
the case, how do we get this fixed in the Dataflow runner? I currently have code running in
prod that rolls it's own Pubsub client to compensate for this size limitation, and I'd really
like to get rid of it.
   
   @reuvenlax 
   > @dadrian true of both the source and the sink, at least for Dataflow streaming. Dataflow's
batch runner does use this code.
   
   @aromanenko-dev 
   > Yes, that is why I was wondering how it's related to any specific runner and @reuvenlax
explained that it's happened that Dataflow runner has it's own implementation for Pubsub support.
   
   If I recall the limitation with the sink was that it was using the gcloud SDK to submit
a grpc request. There was a hard coded default of the maximum number of bytes that one bulk
request could be. I simply allowed the hard coded value to be dynamic.
   
   Since the implementation was in the builder for the sink, I applied the values to both
the bounded and unbounded sinks. 
   
   The source request didn't have a maximum message size API parameter. So it will be enforced
by Pubsub instead of Beam.
   
   If I am understanding this all correctly. This means that it can be used in both the bounded
and unbounded cases.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 132645)
    Time Spent: 3h 20m  (was: 3h 10m)

> Set PubsubIO batch size using builder
> -------------------------------------
>
>                 Key: BEAM-2660
>                 URL: https://issues.apache.org/jira/browse/BEAM-2660
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Carl McGraw
>            Assignee: Chamikara Jayalath
>            Priority: Major
>              Labels: gcp, java, pubsub, sdk
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> PubsubIO doesn't allow users to set the publish batch size. Instead the value is hard
coded in both the BoundedPubsubWriter and the UnboundedPubsubSink. 
> google's pub/sub is bound to a maximum of 10mb per request size. My company has run into
problems with events that are individually smaller than 1mb, but when batched in the 100 or
2000 default batch sizes causes pubsub to fail to send the event.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message