beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jof...@gmail.com
Subject Re: PubSubIO withTimestampAttribute - what are the implications?
Date Thu, 03 Aug 2017 20:34:35 GMT
Thanks Lukasz that's good to know! It sounds like we can halve our PubSub costs then!

Just to clarify, if I were to remove withTimestampAttribute from a job, this would cause the
watermark to always be up to date (processing time) even if the job starts to lag behind (build
up of unacknowledged PubSub messages). In this case would Dataflow's autoscaling still scale
up? I thought the reason the autoscaler scales up is due to the watermark lagging behind,
but is it also aware of the acknowledged PubSub messages?

> On 3 Aug 2017, at 18:58, Lukasz Cwik <lcwik@google.com> wrote:
> 
> You understanding is correct - the data watermark will only matter for windowing. It
will not affect auto-scaling. If the pipeline is not doing any windowing, triggering, etc
then there is no need to pay for the cost of the second subscription. 
> 
>> On Thu, Aug 3, 2017 at 8:17 AM, Josh <jofo90@gmail.com> wrote:
>> Hi all,
>> 
>> We've been running a few streaming Beam jobs on Dataflow, where each job is consuming
from PubSub via PubSubIO. Each job does something like this:
>> 
>> PubsubIO.readMessagesWithAttributes()
>>             .withIdAttribute("unique_id")
>>             .withTimestampAttribute("timestamp");
>> 
>> My understanding of `withTimestampAttribute` is that it means we use the timestamp
on the PubSub message as Beam's concept of time (the watermark) - so that any windowing we
do in the job uses "event time" rather than "processing time".
>> 
>> My question is: is my understanding correct, and does using `withTimestampAttribute`
have any effect in a job that doesn't do any windowing? I have a feeling it may also have
an effect on Dataflow's autoscaling, since I think Dataflow scales up when the watermark timestamp
lags behind, but I'm not sure about this.
>> 
>> The reason I'm concerned about this is because we've been using it in all our Dataflow
jobs, and have now realised that whenever `withTimestampAttribute` is used, Dataflow creates
an additional PubSub subscription (suffixed with `__streaming_dataflow_internal`), which appears
to be doubling PubSub costs (since we pay per subscription)! So I want to remove `withTimestampAttribute`
from jobs where possible, but want to first understand the implications.
>> 
>> Thanks for any advice,
>> Josh
> 

Mime
View raw message