From user-return-14821-apmail-storm-user-archive=storm.apache.org@storm.apache.org Wed Nov 18 14:21:14 2020 Return-Path: X-Original-To: apmail-storm-user-archive@locus.apache.org Delivered-To: apmail-storm-user-archive@locus.apache.org Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by minotaur.apache.org (Postfix) with ESMTP id B6D761AA7B for ; Wed, 18 Nov 2020 14:21:13 +0000 (UTC) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 81A7667998 for ; Wed, 18 Nov 2020 14:21:12 +0000 (UTC) Received: (qmail 81153 invoked by uid 500); 18 Nov 2020 14:21:07 -0000 Delivered-To: apmail-storm-user-archive@storm.apache.org Received: (qmail 81112 invoked by uid 500); 18 Nov 2020 14:21:07 -0000 Mailing-List: contact user-help@storm.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@storm.apache.org Delivered-To: mailing list user@storm.apache.org Received: (qmail 81102 invoked by uid 99); 18 Nov 2020 14:21:07 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Nov 2020 14:21:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 507F11FF39C for ; Wed, 18 Nov 2020 14:21:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=yahoo.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id IqKraJUPoOCm for ; Wed, 18 Nov 2020 14:21:04 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=66.163.184.146; helo=sonic309-20.consmr.mail.ne1.yahoo.com; envelope-from=michael_a_giroux@yahoo.com; receiver= Received: from sonic309-20.consmr.mail.ne1.yahoo.com (sonic309-20.consmr.mail.ne1.yahoo.com [66.163.184.146]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id D011D7F79F for ; Wed, 18 Nov 2020 14:21:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1605709262; bh=C+UFET4QDiWBBxCOUDmf+HEjNJ0RDtiJ2NMfagZn1RE=; h=Date:From:To:In-Reply-To:References:Subject:From:Subject; b=JpsnryUnZcZlnr1x7ZD72HF2IXurC6wi1D8k+bBxDbORFPYH+HJCkkr714Ftbzx3saMKsWRz/e1PIAwoEt3CRTz/YyWW7IKijSEVCmB4V3RnauyXU8Bl15lo4cM+WQZmd3ddnKmvHR1nVahxu9sP9EHguUV0F8DvLmukFfozFnmVuiXr+GHznNabWekdb0GtdWxhzK51Bf0gZKTmx5vsngsUbeAqHpAEXdcel/mZl/mGXXU8i9UmSEH06/rihlsVoLYQJQrM6HSjllVyFts1IyyskzcFZkYqyD39LPdky8lVOOYA5fsGsu7PCNsFjqm5ddWgyqiFlcGaW28sh9dUUA== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1605709262; bh=/+/MXOSB60FvGZPeV6feXP4JtRSK/Rl30sAwr7QdEFq=; h=Date:From:To:Subject:From:Subject; b=uDc6Q1q5ZeHqVLHiBGgjPzEsLJDkYb28/gWbkHn3wv8D/mKqu3BjKAZp1ryPyL4S+FW+OxbxmbG0NSt96BZ8EDaD+xGL4CTaNSQNly7vtTNYxdWorJnMLn7vjB50B3tT5XDSgq3KCKC+DQkykilVBZt76OuyksHpUrUVzABpspV8RkPZdHz0OGnVX8l3A0g6OG38ULx1VsiDWqqF4002fHhdm4VJ21/Tb1X78Y6JI2IvlXF6cv82sEE65u//9/KkSYRK6w8cCl9sBhrUe5oFo9ZxNiOujfWce1TpGf3c/ZQATADRXCVAPf3yd8DGmGwDmO+9RW9cpoqyzjYt617LIg== X-YMail-OSG: Mn42uOIVM1nItq08_vTSVwnDZB_JRODPhCYvJKZKHkBbSt2gfpvlTsX50D7fezR DkYn2iWjxwdXkYTQw_iIndlScRjgUQJwjbmZKrc.OEP9iKefxbvIkxpDJf6h5NOEHeNSU.nUj3f5 dDtRkYWEEyqp.H8fSTxSV0J8QWHaHTer3szGNhKpQmEOyDM7oJ0poe63ly_2j3EGntF0aaxTj97Z _XD1JBoF8rmX2QsTwB_arPNIZp8n3wbm6BZPeT7MvikWoLVl.p7b1Nfp3EqPaiK2VUUUWJ628YOD wbttVDWShu9sXsHQTfnJBL7sJptT32TkqVeDjSx0bA9Njb.xmAgxA.W40Cfp.slfkDXflCtcbMai N4XUAT30QgCabwGB6Ro1TzhUJalmu0ZGNpV9NrEt5XBVwL1Sf8QnTX3Fgdj4_Q4PMXSKj2zOoyYs MAmYWn_GRx0bKFv3rGYOvsCF9lOfCDfOMVlX.u94rxSf7B3ncAWEEWYJkU7rHUyJ0spKl0_x27tV 72EZ_y._J6Zs2RFZshHEvNTszxBxawOoKBvugZV2_Z6UYnoxubNhfXqK6EezjRSatdkX2.Ca_eRu QCTeFpvNk96ud1mwo3PXgSXImnV8fyrFh8JPOd_MvWcpJZ4GpA115QV6hwJ5unCPhoXqvTmvaoPo ubv6vmN4.mEIvPxSG6iTnt68EZCgh1.YPHZODOxzK0vHmIgHzFbvivUfBe.y1taP_sIF0Kdata2t L3zvENy1FVi1xIkwt7wdWTkzuToJkvIzxIq6C9Y41RfHPSVRLbuaSpH8rp8Y1qkLLynafvgHXU4Z CGOeFHY2pEky6hSen9yNJ1Tb.6NqIV9febmETiIfcQcZOBbJAil_.FSJNt7.QuI5jSy0EoN7G.3m vAmIBE5jCv2rAO0kOqlyfbO2iJmO3S5hep4G6jrpAgkueuoFPpt44LoSah3Q01trPxcDRkMs.5cg hyYQ_7PbDK4saX7IXq6ycw7RvsVnHKSNIOl4hrBVnC4bCHQcT6FmNsWUeQnoXC7MJ0Tx1Xirs1OE kBPm0J3gRwQqegWuE7K2se4m5D5xyz88ocIXCfaWuVOOnj2vDXYbol2I_Kh9jCqM5.2.pbvKdU90 IphRhyqhyF77A_Vnzxk.pn9Xf849xij_.ybJ3hLJ1LQxmFZvi2KUKgiBSycCFvGP1iWRITyGmWC3 5VX4NZ4d5lYay.XBwZ.Sc_YXJWT1EXqJm5bDBMAKgP0FOw3SObd5XJSB2DobnlksjtTIfPP73rfU ZshxfGwupOp8aWz09raPRMRklsVCIkS.g.z55Jt_Hfi.7T81jXRFAdtdM1HRNF3UB_e6G2WlFIqK buYGkHBcD_aCc9ydhdpbzoq63GyyIBRzt1EgaJaCRlOHNbK9yrAAY6unpNxYNHTR79IZrg5haVJZ yHCyqDxzRYvolDawQhcuHFSmK7rYcspcTIeobIUda5fk4z6406akjLDofp6rYo4Iv8guknDLgH_n xVONVNZjcx6I1.secjkNVqwZAZB7QilcBOumTooagxy98dqi.jhicNFli75xUzL5YQFo4XkRokWQ tU.A3q5n8.JrmcMHT8WZ5nN7hPbNm2yvylRx99iYu35y658gSR0JDb2KRymB6PDy_EAidxyWVha2 MFUsOTbVqYHuvMigawyDRwWCQolLsOU7TXGCPlKPgHfjGIRKSBZRxN.MVedVAEZPJ7UsIRp91bVH mtDeooQcvbqVqw4vk8w.UI23ACkDNxpVrQaEKj_CcdcsE.4IXnSJs__lvkUBNt8c6qsUnaZQbd3P C7iAcW6e03DahNRz87b_pCOziVC5KREi63iT6FcvN_ZpSbsOM2TziHPRWDg5bOg_6MSjNHI0cuq6 yk6fBps7Hr8l34mf7iDz91sGm3GTT9IK9Ll0GrL4R8QtJt58OHuT.ClUgeSzhdXaaGm5fg5TONdH zcZ5jz_dE4WTQQdZ0Ezw9E7MMAYOPWSBdMeHtp8yDD7xkmlgxpOas7DHdaHM7r5_k.hhFxND9gfQ SM3NCMOE7842OM68faILbGISmeikonGbjb8zS9nF9vdOQQ2YK4z3BM6VoKc_ppcYciHNC0JD_4ko ZdEayhxz92QoZ2MZ4m.mTYBzr3a29muJykfKnTNYvhWWibbONkFc1yn4kg5xltwGQY88y6tR2N4J kxMz5it4umPu2QD7AK9OCEf1NZqaQu5LVpAYfcdPKMKb1y27ZZWE4EZxmL0UQrxrRraa0dd.lsxd kkSyEdb0qfcE.CwL1G7N6UKfoEQWEZyViN72GMEQ7Tbmf55__CNXCbFP.JZc0ikn4E1W73eujQRl F.MIUdN05hMEGbWvF2d2oX.fJrSQLECVi.dKCy1ohch9wlk.kwc2UWfeAdFPaLQsULX4MjoHVy7x o1QOGiTde2ls0giuR34bgo_vzMSsZ5WpiHAoNUlIEu6kySrYFhBxGVVSt0JBfilMqICEH46nO_fo b_J_haVn8FxzPn2iVYx05NzeR3dbVD_Zyp7gk9abuGCw0VvypUKpQj0I7RBl7jV78SttBrHA.p0R PRXVp3y2YtdNNt1GWNesH1rHJp8jjrBiKdGGqu_ow7Hy.982Be53lU11Q04fa08E3sB2OdmVXR2e xWHMSFi5oDgPp4ZNY6efZgOEA9XhhIl2AeW0KE_TVgKOs Received: from sonic.gate.mail.ne1.yahoo.com by sonic309.consmr.mail.ne1.yahoo.com with HTTP; Wed, 18 Nov 2020 14:21:02 +0000 Date: Wed, 18 Nov 2020 14:20:59 +0000 (UTC) From: Michael Giroux To: "user@storm.apache.org" Message-ID: <407356590.7009859.1605709259606@mail.yahoo.com> In-Reply-To: References: Subject: Re: Topology is stuck after upgrade to Storm 2.2.0 - how can I analyze what's going on? MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_7009858_216367418.1605709259604" X-Mailer: WebService/1.1.16944 YMailNorrin Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 ------=_Part_7009858_216367418.1605709259604 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks for info.=C2=A0 We/I havent tapped into the metrics (yet?).=C2=A0 G= lad you got your problem resolved. On Wednesday, November 18, 2020, 09:14:21 AM EST, Adam Honen wrote: =20 =20 Hi, I've managed to resolve this, so it's probably best to share what was the i= ssue in my case.As mentioned above, we have our own back pressure mechanism= .It's all controlled from the spout, so I figured out (read: guessed) we're= probably hitting Storm's limit for the Spout's queue. After increasing topology.executor.receive.buffer.size further, so it becam= e larger than our own limit (50K in this case) the issue is resolved. Now, as for identifying this more easily next time, I see in the code that = this configuration is read in WorkerState.mkReceiveQueueMap and sent to the= constructor of JCQueue where a metrics object is created.Looks like there = are some really useful metrics reported there. So for next time I plan on hooking up to these metrics (either via one of t= he built in reporters, or a via new implementation better geared for our ne= eds) and reporting some of them to our monitoring system.That should make t= roubleshooting such issues way simpler. I haven't tested this part yet and it's not documented here: https://storm.= apache.org/releases/2.2.0/ClusterMetrics.html , but hopefully it should sti= ll work. On Tue, Nov 17, 2020 at 4:08 PM Adam Honen wrote: Hi, I'm wondering what sort of metrics, logs, or other indications I can use in= order to understand why my topology gets stuck after ugrading from Storm 1= .1.1 to Storm 2.2.0. More in length: I have a 1.1.1 cluster with 40 workers processing ~400K events/second.It st= arts by reading from Kinesis via the AWS KCL and this is also used to imple= ment our own backpressure. That is, when the topology is overloaded with tu= ples, we stop reading from Kinesis until enough progress has been made (we'= ve been able to checkpoint).After that, we resume reading. However, with so many workers we don't really see back pressure being neede= d even when dealing with much larger event rates. We've now created a similar cluster with storm 2.2.0 and I've tried deployi= ng our topology there.However, what happens is that within a couple of seco= nds, no more Kinesis records get read. The topology appears to be just wait= ing forever without processing anything. I would like to troubleshoot this, but I'm not sure where to collect data f= rom.My initial suspicion was that the new back pressure mechanism, now foun= d in Storm 2, might have kicked in and that I need to configure it in order= to resolve this issue. However, this is nothing more than a guess. I'm not= sure how I can actually prove or disprove this without lots of trail & err= or. I've found some documentation about backpressure in the performance tuning = chapter of the documentation, but that only concentrates on configuration p= arameters and doesn't give information about how to really understand what'= s going on in a running topology. =20 ------=_Part_7009858_216367418.1605709259604 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks for info.  We/I = havent tapped into the metrics (yet?).  Glad you got your problem reso= lved.

=20
=20
On Wednesday, November 18, 2020, 09:14:21 AM EST, Adam = Honen <adam@viber.com> wrote:


Hi,

I've managed to resol= ve this, so it's probably best to share what was the issue in my case.
As mentioned above, we have our own back pressure mechanism.
It's all controlled from the spout, so I figured out (read: guessed) we'r= e probably hitting Storm's limit for the Spout's queue.

After increasing topology.executor.receive.buffer.size= further, so it became larger than our own limit (50K in this case) the iss= ue is resolved.

Now, as for identif= ying this more easily next time, I see in the code that this configuration = is read in WorkerState.= mkReceiveQueueMap and sent to the constructor of JCQueue where a metrics ob= ject is created.
Looks like there are some really useful metrics reported there.<= /span>

So for next time I plan on hooking up to these metrics (either via = one of the built in reporters, or a via new implementation better geared fo= r our needs) and reporting some of them to our monitoring system.
That should mak= e troubleshooting such issues way simpler.
<= div>I haven't tested th= is part yet and it's not documented here: https://storm.apache.org/releases/2.2.0/ClusterMetrics.html , but hopefully it should still work.


Hi,

I'm wondering what so= rt of metrics, logs, or other indications I can use in order to understand = why my topology gets stuck after ugrading from Storm 1.1.1 to Storm 2.2.0.<= br clear=3D"none">



More in length:
I have a 1.1.1 cluster with 40 workers processing ~400K event= s/second.
It starts by reading from Kinesis via the AWS KCL and t= his is also used to implement our own backpressure. That is, when the topol= ogy is overloaded with tuples, we stop reading from Kinesis until enough pr= ogress has been made (we've been able to checkpoint).
After that,= we resume reading.

However, with s= o many workers we don't really see back pressure being needed even when dea= ling with much larger event rates.

= We've now created a similar cluster with storm 2.2.0 and I've tried deployi= ng our topology there.
However, what happens is that within a cou= ple of seconds, no more Kinesis records get read. The topology appears to b= e just waiting forever without processing anything.

I would like to troubleshoot this, but I'm not sure where = to collect data from.
My initial suspicion was that the new back = pressure mechanism, now found in Storm 2, might have kicked in and that I n= eed to configure it in order to resolve this issue. However, this is nothin= g more than a guess. I'm not sure how I can actually prove or disprove this= without lots of trail & error.

I've found some documentation about backpressure in the performance tuning= chapter of the documentation, but that only concentrates on configuration = parameters and doesn't give information about how to really understand what= 's going on in a running topology.
------=_Part_7009858_216367418.1605709259604--