Mailing-List: contact user-help@beam.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@beam.apache.org
MIME-Version: 1.0
In-Reply-To: <CABmO9D-+TV78zRuRHMkAWYCUB=KwWWND+C=_UFUbC1aUC+maXw@mail.gmail.com>
References: <CABmO9D_kA2CDTBv8ekq2hxwxtxsT74D4jhKicY9YR_0QqaRogA@mail.gmail.com>
 <CD8F8044-EDB6-4247-92EF-AB141E6A7FB5@malloc64.com> <CABmO9D8RJ=BM_00qfRDc5gAw7mZdyvnxGAJhtx+Xw98z0NTObQ@mail.gmail.com>
 <E9D3B204-D2D0-4DDE-A585-ADCE781C709E@malloc64.com> <CAF9t7_6oMsHjH4Py+hku0SURfwofk5FNFwzUD1JOBAXxtQ5dnw@mail.gmail.com>
 <CABmO9D8sV6Pw+=_wGOyBJC60a2+bvCHzieLXUD3cKSdc0DxgMg@mail.gmail.com>
 <5A7E98D2-A63B-405E-B201-87ADC54E2B3C@malloc64.com> <CABmO9D-+TV78zRuRHMkAWYCUB=KwWWND+C=_UFUbC1aUC+maXw@mail.gmail.com>
From: Raghu Angadi <rangadi@google.com>
Date: Wed, 24 May 2017 10:50:01 -0700
Message-ID: <CAGwR7sBG3dRL05K0YSaHdGOLF0uw36h55DjfwZ9NA3_49anSZw@mail.gmail.com>
Subject: Re: How to decrease latency when using PubsubIO.Read?
To: user@beam.apache.org
Content-Type: multipart/alternative; boundary="001a113aca6acecc1f055048bb71"
archived-at: Wed, 24 May 2017 17:50:08 -0000

--001a113aca6acecc1f055048bb71
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Josh,

Can you share your job_id? I could take look. Are you measuring latency
end-to-end (publisher to when it appears on BT?). Are you using BigtableIO
for sink?

There is no easy way to use more workers when auto-scaling is enabled. It
thinks your backlog and CPU are low enough and does not need to scale.
Raghu.

On Wed, May 24, 2017 at 10:14 AM, Josh <jofo90@gmail.com> wrote:

> Thanks Ankur, that's super helpful! I will give these optimisations a go.
>
> About the "No operations completed" message - there are a few of these in
> the logs (but very few, like 1 an hour or something) - so probably no nee=
d
> to scale up Bigtable.
> I did however see a lot of INFO messages "Wrote 0 records" in the logs. P=
robably
> about 50% of the "Wrote n records" messages are zero. While the other 50%
> are quite high (e.g. "Wrote 80 records"). Not sure if that could indicate=
 a
> bad setting?
>
> Josh
>
>
>
> On Wed, May 24, 2017 at 5:22 PM, Ankur Chauhan <ankur@malloc64.com> wrote=
:
>
>> There are two main things to see here:
>>
>> * In the logs, are there any messages like "No operations completed
>> within the last 61 seconds. There are still 1 simple operations and 1
>> complex operations in progress.=E2=80=9D This means you are underscaled =
on the
>> bigtable side and would benefit from  increasing the node count.
>> * We also saw some improvement in performance (workload dependent) by
>> going to a bigger worker machine type.
>> * Another optimization that worked for our use case:
>>
>> // streaming dataflow has larger machines with smaller bundles, so we ca=
n queue up a lot more without blowing up
>> private static BigtableOptions createStreamingBTOptions(AnalyticsPipelin=
eOptions opts) {
>>     return new BigtableOptions.Builder()
>>             .setProjectId(opts.getProject())
>>             .setInstanceId(opts.getBigtableInstanceId())
>>             .setUseCachedDataPool(true)
>>             .setDataChannelCount(32)
>>             .setBulkOptions(new BulkOptions.Builder()
>>                     .setUseBulkApi(true)
>>                     .setBulkMaxRowKeyCount(2048)
>>                     .setBulkMaxRequestSize(8_388_608L)
>>                     .setAsyncMutatorWorkerCount(32)
>>                     .build())
>>             .build();
>> }
>>
>>
>> There is a lot of trial and error involved in getting the end-to-end
>> latency down so I would suggest enabling the profiling using the
>> =E2=80=94saveProfilesToGcs option and get a sense of what is exactly hap=
pening.
>>
>> =E2=80=94 Ankur Chauhan
>>
>> On May 24, 2017, at 9:09 AM, Josh <jofo90@gmail.com> wrote:
>>
>> Ah ok - I am using the Dataflow runner. I didn't realise about the custo=
m
>> implementation being provided at runtime...
>>
>> Any ideas of how to tweak my job to either lower the latency consuming
>> from PubSub or to lower the latency in writing to Bigtable?
>>
>>
>> On Wed, May 24, 2017 at 4:14 PM, Lukasz Cwik <lcwik@google.com> wrote:
>>
>>> What runner are you using (Flink, Spark, Google Cloud Dataflow, Apex,
>>> ...)?
>>>
>>> On Wed, May 24, 2017 at 8:09 AM, Ankur Chauhan <ankur@malloc64.com>
>>> wrote:
>>>
>>>> Sorry that was an autocorrect error. I meant to ask - what dataflow
>>>> runner are you using? If you are using google cloud dataflow then the
>>>> PubsubIO class is not the one doing the reading from the pubsub topic.=
 They
>>>> provide a custom implementation at run time.
>>>>
>>>> Ankur Chauhan
>>>> Sent from my iPhone
>>>>
>>>> On May 24, 2017, at 07:52, Josh <jofo90@gmail.com> wrote:
>>>>
>>>> Hi Ankur,
>>>>
>>>> What do you mean by runner address?
>>>> Would you be able to link me to the comment you're referring to?
>>>>
>>>> I am using the PubsubIO.Read class from Beam 2.0.0 as found here:
>>>> https://github.com/apache/beam/blob/release-2.0.0/sdks/java/
>>>> io/google-cloud-platform/src/main/java/org/apache/beam/sdk/i
>>>> o/gcp/pubsub/PubsubIO.java
>>>>
>>>> Thanks,
>>>> Josh
>>>>
>>>> On Wed, May 24, 2017 at 3:36 PM, Ankur Chauhan <ankur@malloc64.com>
>>>> wrote:
>>>>
>>>>> What runner address you using. Google cloud dataflow uses a closed
>>>>> source version of the pubsub reader as noted in a comment on Read cla=
ss.
>>>>>
>>>>> Ankur Chauhan
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 24, 2017, at 04:05, Josh <jofo90@gmail.com> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm using PubsubIO.Read to consume a Pubsub stream, and my job then
>>>>> writes the data out to Bigtable. I'm currently seeing a latency of 3-=
5
>>>>> seconds between the messages being published and being written to Big=
table.
>>>>>
>>>>> I want to try and decrease the latency to <1s if possible - does
>>>>> anyone have any tips for doing this?
>>>>>
>>>>> I noticed that there is a PubsubGrpcClient
>>>>> https://github.com/apache/beam/blob/release-2.0.0/sdks/java/
>>>>> io/google-cloud-platform/src/main/java/org/apache/beam/sdk/i
>>>>> o/gcp/pubsub/PubsubGrpcClient.java however the PubsubUnboundedSource
>>>>> is initialised with a PubsubJsonClient, so the Grpc client doesn't ap=
pear
>>>>> to be being used. Is there a way to switch to the Grpc client - as pe=
rhaps
>>>>> that would give better performance?
>>>>>
>>>>> Also, I am running my job on Dataflow using autoscaling, which has
>>>>> only allocated one n1-standard-4 instance to the job, which is
>>>>> running at ~50% CPU. Could forcing a higher number of nodes help impr=
ove
>>>>> latency?
>>>>>
>>>>> Thanks for any advice,
>>>>> Josh
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

--001a113aca6acecc1f055048bb71
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Josh,<div><br></div><div>Can you share your job_id? I coul=
d take look. Are you measuring latency end-to-end (publisher to when it app=
ears on BT?). Are you using BigtableIO for sink?</div><div><br></div><div>T=
here is no easy way to use more workers when auto-scaling is enabled. It th=
inks your backlog and CPU are low enough and does not need to scale.</div><=
div>Raghu.</div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_qu=
ote">On Wed, May 24, 2017 at 10:14 AM, Josh <span dir=3D"ltr">&lt;<a href=
=3D"mailto:jofo90@gmail.com" target=3D"_blank">jofo90@gmail.com</a>&gt;</sp=
an> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thanks Ankur,=
 that&#39;s super helpful! I will give these optimisations a go.<div><br></=
div><div>About the &quot;No operations completed&quot; message - there are =
a few of these in the logs (but very few, like 1 an hour or something) - so=
 probably no need to scale up Bigtable.</div><div>I did however see a lot o=
f INFO messages &quot;<span style=3D"background-color:rgb(250,250,250);colo=
r:rgba(0,0,0,0.87);font-family:monospace;font-size:12px;letter-spacing:0.12=
px;white-space:nowrap">Wrote 0 records&quot; in the logs.=C2=A0</span>Proba=
bly about 50% of the &quot;Wrote n records&quot; messages are zero. While t=
he other 50% are quite high (e.g. &quot;Wrote 80 records&quot;). Not sure i=
f that could indicate a bad setting?</div><span class=3D"HOEnZb"><font colo=
r=3D"#888888"><div><br></div><div>Josh</div><div><br></div><div><br></div><=
/font></span></div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gm=
ail_extra"><br><div class=3D"gmail_quote">On Wed, May 24, 2017 at 5:22 PM, =
Ankur Chauhan <span dir=3D"ltr">&lt;<a href=3D"mailto:ankur@malloc64.com" t=
arget=3D"_blank">ankur@malloc64.com</a>&gt;</span> wrote:<br><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex"><div style=3D"word-wrap:break-word">There are two main thi=
ngs to see here:<div><br></div><div>* In the logs, are there any messages l=
ike=C2=A0&quot;No operations completed within the last 61 seconds. There ar=
e still 1 simple operations and 1 complex operations in progress.=E2=80=9D =
This means you are underscaled on the bigtable side and would benefit from =
=C2=A0increasing the node count.</div><div>* We also saw some improvement i=
n performance (workload dependent) by going to a bigger worker machine type=
.</div><div>* Another optimization that worked for our use case:</div><div>=
<br></div><div><pre style=3D"background-color:#2b2b2b;color:#a9b7c6;font-fa=
mily:&#39;Menlo&#39;;font-size:9.0pt"><span style=3D"color:#808080">// stre=
aming dataflow has larger machines with smaller bundles, so we can queue up=
 a lot more without blowing up<br></span><span style=3D"color:#cc7832">priv=
ate static </span>BigtableOptions <span style=3D"color:#ffc66d">createStrea=
mingBTOptions</span>(Analy<wbr>ticsPipelineOptions opts) {<br>    <span sty=
le=3D"color:#cc7832">return new </span>BigtableOptions.Builder()<br>       =
     .setProjectId(opts.getProject(<wbr>))<br>            .setInstanceId(op=
ts.getBigtabl<wbr>eInstanceId())<br>            .setUseCachedDataPool(<span=
 style=3D"color:#cc7832">true</span>)<br>            .setDataChannelCount(<=
span style=3D"color:#6897bb">32</span>)<br>            .setBulkOptions(<spa=
n style=3D"color:#cc7832">new </span>BulkOptions.Builder()<br>             =
       .setUseBulkApi(<span style=3D"color:#cc7832">true</span>)<br>       =
             .setBulkMaxRowKeyCount(<span style=3D"color:#6897bb">2048</spa=
n>)<br>                    .setBulkMaxRequestSize(<span style=3D"color:#689=
7bb">8_388_6<wbr>08L</span>)<br>                    .setAsyncMutatorWorkerC=
ount(<span style=3D"color:#6897bb">32</span><wbr>)<br>                    .=
build())<br>            .build()<span style=3D"color:#cc7832">;<br></span>}=
</pre><div><br></div></div><div>There is a lot of trial and error involved =
in getting the end-to-end latency down so I would suggest enabling the prof=
iling using the =E2=80=94saveProfilesToGcs option and get a sense of what i=
s exactly happening.</div><span class=3D"m_5851255783659973068HOEnZb"><font=
 color=3D"#888888"><div><br></div><div>=E2=80=94 Ankur Chauhan</div></font>=
</span><div><div class=3D"m_5851255783659973068h5"><div><br><div><blockquot=
e type=3D"cite"><div>On May 24, 2017, at 9:09 AM, Josh &lt;<a href=3D"mailt=
o:jofo90@gmail.com" target=3D"_blank">jofo90@gmail.com</a>&gt; wrote:</div>=
<br class=3D"m_5851255783659973068m_-7224243499283984630Apple-interchange-n=
ewline"><div><div dir=3D"ltr">Ah ok - I am using the Dataflow runner. I did=
n&#39;t realise about the custom implementation being provided at runtime..=
.<div><br></div><div>Any ideas of how to tweak my job to either lower the l=
atency consuming from PubSub or to lower the latency in writing to Bigtable=
?</div><div><br></div></div><div class=3D"gmail_extra"><br><div class=3D"gm=
ail_quote">On Wed, May 24, 2017 at 4:14 PM, Lukasz Cwik <span dir=3D"ltr">&=
lt;<a href=3D"mailto:lcwik@google.com" target=3D"_blank">lcwik@google.com</=
a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0=
 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Wha=
t runner are you using (Flink, Spark, Google Cloud Dataflow, Apex, ...)?</d=
iv><div class=3D"m_5851255783659973068m_-7224243499283984630HOEnZb"><div cl=
ass=3D"m_5851255783659973068m_-7224243499283984630h5"><div class=3D"gmail_e=
xtra"><br><div class=3D"gmail_quote">On Wed, May 24, 2017 at 8:09 AM, Ankur=
 Chauhan <span dir=3D"ltr">&lt;<a href=3D"mailto:ankur@malloc64.com" target=
=3D"_blank">ankur@malloc64.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"auto"><div>Sorry that was an autocorrect error. I=
 meant to ask - what dataflow runner are you using? If you are using google=
 cloud dataflow then the PubsubIO class is not the one doing the reading fr=
om the pubsub topic. They provide a custom implementation at run time.</div=
><span><div id=3D"m_5851255783659973068m_-7224243499283984630m_-16662338008=
41685529m_-2467079863350409243AppleMailSignature"><br></div><div id=3D"m_58=
51255783659973068m_-7224243499283984630m_-1666233800841685529m_-24670798633=
50409243AppleMailSignature">Ankur Chauhan=C2=A0<br>Sent from my iPhone</div=
></span><div><div class=3D"m_5851255783659973068m_-7224243499283984630m_-16=
66233800841685529h5"><div><br>On May 24, 2017, at 07:52, Josh &lt;<a href=
=3D"mailto:jofo90@gmail.com" target=3D"_blank">jofo90@gmail.com</a>&gt; wro=
te:<br><br></div><blockquote type=3D"cite"><div><div dir=3D"ltr">Hi Ankur,<=
div><br></div><div>What do you mean by runner address?</div><div>Would you =
be able to link me to the comment you&#39;re referring to?</div><div><br></=
div><div>I am using the PubsubIO.Read class from Beam 2.0.0 as found here:<=
/div><div><a href=3D"https://github.com/apache/beam/blob/release-2.0.0/sdks=
/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pub=
sub/PubsubIO.java" target=3D"_blank">https://github.com/apache/beam<wbr>/bl=
ob/release-2.0.0/sdks/java/<wbr>io/google-cloud-platform/src/m<wbr>ain/java=
/org/apache/beam/sdk/i<wbr>o/gcp/pubsub/PubsubIO.java</a><br></div><div><br=
></div><div>Thanks,</div><div>Josh</div><div class=3D"gmail_extra"><br><div=
 class=3D"gmail_quote">On Wed, May 24, 2017 at 3:36 PM, Ankur Chauhan <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:ankur@malloc64.com" target=3D"_blank">an=
kur@malloc64.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote"=
 style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);p=
adding-left:1ex"><div dir=3D"auto"><div>What runner address you using. Goog=
le cloud dataflow uses a closed source version of the pubsub reader as note=
d in a comment on Read class.=C2=A0<span class=3D"m_5851255783659973068m_-7=
224243499283984630m_-1666233800841685529m_-2467079863350409243gmail-HOEnZb"=
><font color=3D"#888888"><br><br>Ankur Chauhan<br>Sent from my iPhone</font=
></span></div><div><div class=3D"m_5851255783659973068m_-722424349928398463=
0m_-1666233800841685529m_-2467079863350409243gmail-h5"><div><br>On May 24, =
2017, at 04:05, Josh &lt;<a href=3D"mailto:jofo90@gmail.com" target=3D"_bla=
nk">jofo90@gmail.com</a>&gt; wrote:<br><br></div><blockquote type=3D"cite">=
<div><div dir=3D"ltr">Hi all,<div><br></div><div>I&#39;m using PubsubIO.Rea=
d to consume a Pubsub stream, and my job then writes the data out to Bigtab=
le. I&#39;m currently seeing a latency of 3-5 seconds between the messages =
being published and being written to Bigtable.</div><div><br></div><div>I w=
ant to try and decrease the latency to &lt;1s if possible - does anyone hav=
e any tips for doing this?=C2=A0</div><div><br></div><div>I noticed that th=
ere is a PubsubGrpcClient <a href=3D"https://github.com/apache/beam/blob/re=
lease-2.0.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/bea=
m/sdk/io/gcp/pubsub/PubsubGrpcClient.java" target=3D"_blank">https://github=
.com/apache/beam<wbr>/blob/release-2.0.0/sdks/java/<wbr>io/google-cloud-pla=
tform/src/m<wbr>ain/java/org/apache/beam/sdk/i<wbr>o/gcp/pubsub/PubsubGrpcC=
lient.<wbr>java</a> however the PubsubUnboundedSource is initialised with a=
 PubsubJsonClient, so the Grpc client doesn&#39;t appear to be being used. =
Is there a way to switch to the Grpc client - as perhaps that would give be=
tter performance?</div><div><br></div><div>Also, I am running my job on Dat=
aflow using autoscaling, which has only allocated one=C2=A0<span style=3D"f=
ont-family:Roboto,sans-serif;font-size:12px;letter-spacing:0.12px">n1-stand=
ard-4</span>=C2=A0instance to the job, which is running at ~50% CPU. Could =
forcing a higher number of nodes help improve latency?</div><div><br></div>=
<div>Thanks for any advice,</div><div>Josh</div></div>
</div></blockquote></div></div></div></blockquote></div><br></div></div>
</div></blockquote></div></div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></blockquote></div><br></div></div></div></div></blockquote></div><br=
></div>
</div></div></blockquote></div><br></div>

--001a113aca6acecc1f055048bb71--