From user-return-14562-archive-asf-public=cust-asf.ponee.io@storm.apache.org Wed Aug 28 14:30:50 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id A4C90180181 for ; Wed, 28 Aug 2019 16:30:49 +0200 (CEST) Received: (qmail 83262 invoked by uid 500); 28 Aug 2019 14:30:48 -0000 Mailing-List: contact user-help@storm.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@storm.apache.org Delivered-To: mailing list user@storm.apache.org Received: (qmail 83250 invoked by uid 99); 28 Aug 2019 14:30:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Aug 2019 14:30:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id F3AB5180EC2 for ; Wed, 28 Aug 2019 14:30:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.05 X-Spam-Level: ** X-Spam-Status: No, score=2.05 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id OWkxHOunBolm for ; Wed, 28 Aug 2019 14:30:42 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::244; helo=mail-oi1-x244.google.com; envelope-from=generalbas.srd@gmail.com; receiver= Received: from mail-oi1-x244.google.com (mail-oi1-x244.google.com [IPv6:2607:f8b0:4864:20::244]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 1F6CE7DC32 for ; Wed, 28 Aug 2019 14:30:42 +0000 (UTC) Received: by mail-oi1-x244.google.com with SMTP id 16so2252180oiq.6 for ; Wed, 28 Aug 2019 07:30:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=LMzyV9L/cI8hxfvSXF5Oc9A0tp2FFpvb0FWqX6Fk50s=; b=rHn0cyYdLD8OnDPZIY6WnoUG5DVWQQkOa9lpwnHe6wSHljvTiKa1nkMLVqdc+EDbIk wPcDrY4iR2omEurVTgdoKM3VsAwDebqOCJY3Ah7uxkFc8COHM4K+6+uP0FAFExTnzEUc S5nDKBpfHu8xhpFK2R+7IZh7HLb73rmjO37sQ7UJa/Z7JFKkcPSPKnDR/902QNSufe4T ml5mJ/gUtvzKXfhy0bOimo8AUyLdw5Lt4V+YfteavwPxrYl54/8gz1H1z+6MVhOUUfoQ 1DMbWF6ql060D4gUCfcCXXktPnO2GGDc+zTV3P6PhmAa6NTfdrTA402IP3blWjGeqg7N 1Wgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=LMzyV9L/cI8hxfvSXF5Oc9A0tp2FFpvb0FWqX6Fk50s=; b=qQymVtmSTunbgXyDDcKYU/+gJPlRPspQu+3Lwg2xUIo89CyfXjY4A3mSkT3ImY8dVd F8EsmAOtsDJMzAZNz2AR4vGjFm+TWNu7Eocv5w8LDAZDOTXfTmBVcZftSjaxizkTE85x Flwf8ikF+EOmOs25EzCq3M9NVW4BOovYAXwISfM4aRqOgcywszoLGii0H2OXCI4BWWKR k1exbwzfg6gZwuybITOCDhrrb8eIZ8KgwLwsLEmP9QXVoseX3MpyOO0SUIgDbkZmd8i/ ckLrBmm6rOf/GOWWLGM5BaxM+rAkM7b5IP8zdWEEOdvgrIi8qk3HMeVusNBjJ59oqr4t Ht/w== X-Gm-Message-State: APjAAAUacB8BHCoB3ZFJtWr+tJhxIDJQKKbs2KjdFzH36YBjGVjcBbpK mfkfBpRO/OwBfMWQP86X6cuTRLNustYYRdWXsWvg4Q== X-Google-Smtp-Source: APXvYqyaDkvlyI5cLXE+E8B8vWhrD556Qxak9N9OHaNQ+PLgcoD2jdFgexk4u6bK2nFMNE+EVzJiHxh/zQF/aseiYqI= X-Received: by 2002:a05:6808:2c3:: with SMTP id a3mr2997618oid.121.1567002640637; Wed, 28 Aug 2019 07:30:40 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: =?UTF-8?Q?Stig_Rohde_D=C3=B8ssing?= Date: Wed, 28 Aug 2019 16:30:30 +0200 Message-ID: Subject: Re: Storm 2.0.0 Local Cluster worker restart To: user@storm.apache.org Content-Type: multipart/alternative; boundary="000000000000c1e25e05912e3bbf" --000000000000c1e25e05912e3bbf Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Yes, you are right, that is better. Regarding tests, there's an AsyncLocalizerTest that stubs the blob store, and makes some assertions about download behavior. Maybe that would be a good fit? Den ons. 28. aug. 2019 kl. 14.30 skrev Diogo Monteiro < diogo.monteiro12@gmail.com>: > Hi Stig, > Thanks for the reply. > > Makes sense. But isn't it probably better if we just explicitly create an > empty resources directory (*Files.createDirectories(extractionDest)* is > enough) in an else clause, instead of calling *extractDirFromJar*? > > I say this because if we don't have a resources jar or a classpath url, w= e > aren't really extracting anything from a jar. > > Another thing, what's your opinion on how I should test this? > *LocallyCachedTopologyBlob* do not have unit tests. > > Diogo. > > On Tue, Aug 27, 2019 at 5:32 PM Stig Rohde D=C3=B8ssing > wrote: > >> Hi Diogo, >> >> Thanks for your thorough explanation. I think you are right, and this is >> a bug. We'd be happy to see a PR to fix this. >> >> I think a decent way to handle this could be adding an extra else clause >> to >> https://github.com/apache/storm/blob/2ba95bbd1c911d4fc6363b1c4b9c4c6d86a= c9aae/storm-server/src/main/java/org/apache/storm/localizer/LocallyCachedTo= pologyBlob.java#L146, >> and simply create an empty resources directory in the blob extraction >> directory, by calling extractDirFromJar(resourcesJar, ServerConfigUtils. >> RESOURCES_SUBDIR, extractionDest);. This is just me spitballing, so >> please feel free to fix it some other way if you have a better idea. >> >> Den tir. 27. aug. 2019 kl. 14.50 skrev Diogo Monteiro < >> diogo.monteiro12@gmail.com>: >> >>> Hi all, >>> >>> My name is Diogo and I am a dev for Paddy Power Betfair in Porto, >>> Portugal. We're running Storm 1.x.x in production for a couple of years= and >>> the time has come for us to upgrade to 2.0.0. We use *LocalCluster* to >>> run topologies in our local machines to perform manually tests. >>> >>> So, going to the point: I was trying to launch a topology that I'm >>> developing (in 2.0.0) and noticed that the worker was getting restarted >>> each ~30 seconds. >>> I placed a breakpoint in the *kill* method of *LocalContainer* ( >>> https://github.com/apache/storm/blob/2ba95bbd1c911d4fc6363b1c4b9c4c6d86= ac9aae/storm-server/src/main/java/org/apache/storm/daemon/supervisor/LocalC= ontainer.java#L66) >>> to try and understand why the worker was getting restarted. >>> >>> The call stack was: >>> >>> kill:66, LocalContainer (org.apache.storm.daemon.supervisor) >>> killContainerFor:269, Slot (org.apache.storm.daemon.supervisor) >>> handleRunning:724, Slot (org.apache.storm.daemon.supervisor) >>> stateMachineStep:218, Slot (org.apache.storm.daemon.supervisor) >>> run:931, Slot (org.apache.storm.daemon.supervisor) >>> >>> >>> With this I can understand that the worker is killed because a blob has >>> changed ( >>> https://github.com/apache/storm/blob/2ba95bbd1c911d4fc6363b1c4b9c4c6d86= ac9aae/storm-server/src/main/java/org/apache/storm/daemon/supervisor/Slot.j= ava#L724). >>> In fact, there's a changing blob in the *dynamicState* at that point. >>> >>> I checked the *AsyncLocalizer *which downloads, caches blobs locally, >>> and notifies the Slot state machine of a changing blob. >>> >>> I noticed this: >>> >>> - >>> https://github.com/apache/storm/blob/2ba95bbd1c911d4fc6363b1c4b9c4c6= d86ac9aae/storm-server/src/main/java/org/apache/storm/localizer/AsyncLocali= zer.java#L339 >>> - >>> https://github.com/apache/storm/blob/2ba95bbd1c911d4fc6363b1c4b9c4c6= d86ac9aae/storm-server/src/main/java/org/apache/storm/localizer/AsyncLocali= zer.java#L265 >>> - >>> https://github.com/apache/storm/blob/2ba95bbd1c911d4fc6363b1c4b9c4c6= d86ac9aae/storm-server/src/main/java/org/apache/storm/localizer/LocallyCach= edTopologyBlob.java#L142 >>> - >>> https://github.com/apache/storm/blob/2ba95bbd1c911d4fc6363b1c4b9c4c6= d86ac9aae/storm-server/src/main/java/org/apache/storm/localizer/LocallyCach= edTopologyBlob.java#L192 >>> >>> >>> Which tell me that (correct me if I'm wrong): >>> >>> - Supervisor tries to update blobs each 30 seconds. >>> - The topology jar blob requires extraction of the resources >>> directory (either from a jar or directly in a classpath URL). It doe= s so in >>> *fetchUnzipToTemp *and it's existence is checked in >>> *isFullyDownloaded*. >>> - The Slot is notified of a changing blob if: >>> - the remote version is different from the local version (the >>> code has changed). >>> - OR the blob is not fully downloaded (the jar exists, and the >>> extracted resources directory exists). >>> >>> Well, I did not have a resources folder under the root of the classpath= , >>> and that's why the worker was being restarted each ~30 seconds, as the = Slot >>> was being notified of a changing blob everytime *updateBlobs* ran. >>> I created a resources folder (with dummy files) under the root of the >>> classpath and the problem is now solved. >>> >>> However, if I understand correctly, the resources folder is only >>> required for *multilang*. Our topologies do not use *multilang *and >>> this do not happen in Storm 1.1.3 for instance. >>> >>> Am I seeing or doing something wrong and this is an expected behaviour? >>> I am happy to contribute if this is in fact something worth to open an >>> issue and fix. >>> >>> Hope this is the right place for these questions, and thanks in advance >>> for taking your time to look at this. >>> >>> Regards, >>> Diogo >>> >> --000000000000c1e25e05912e3bbf Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Yes, you are right, that is better.

Regarding tests, there's an AsyncLocalizerTest that stubs the b= lob store, and makes some assertions about download behavior. Maybe that wo= uld be a good fit?

Den ons. 28. aug. 2019 kl. 14.30 skrev Diogo = Monteiro <diogo.monteiro12= @gmail.com>:
Hi Stig,=C2=A0
Thanks for the rep= ly.

Makes sense. But isn't it probably better = if we just explicitly create an empty resources directory (Files.createD= irectories(extractionDest) is enough) in an else clause, instead of cal= ling extractDirFromJar?=C2=A0=C2=A0

I say t= his because if we don't have a resources jar or a classpath url, we are= n't really extracting anything from a jar.

Ano= ther thing, what's your opinion on how I should test this? LocallyCa= chedTopologyBlob do not have unit tests.

Diogo= .

On Tue, Aug 27, 2019 at 5:32 PM Stig Rohde D=C3=B8ssing <stigdoessing@gmail.com= > wrote:
=
Hi Diogo,

Thanks for your th= orough explanation. I think you are right, and this is a bug. We'd be h= appy to see a PR to fix this.

I think a decen= t way to handle this could be adding an extra else clause to https://github.com/apache/storm/blob/2ba9= 5bbd1c911d4fc6363b1c4b9c4c6d86ac9aae/storm-server/src/main/java/org/apache/= storm/localizer/LocallyCachedTopologyBlob.java#L146, and simply create = an empty resources directory in the blob extraction directory, by calling= =20 extractDirFromJar(resourcesJar, Server= ConfigUtils.RESOURCES_SUBDIR, extractionDest);. This is ju= st me spitballing, so please feel free to fix it some other way if you have= a better idea.

Den tir. 27. aug. 2019 kl. 14.50 skrev Diogo Monte= iro <dio= go.monteiro12@gmail.com>:
Hi = all,=C2=A0

My name is Diogo a= nd I am a dev for Paddy Power Betfair in Porto, Portugal. We're running= Storm 1.x.x in production for a couple of years and the time has come for = us to upgrade to 2.0.0. We use=C2=A0LocalCluster=C2=A0to run topolog= ies in our local machines to perform manually tests.=C2=A0

So, going to the point: I was trying to launch a topology that I&#= 39;m developing (in 2.0.0) and noticed that the worker was getting restarte= d each ~30 seconds.=C2=A0
I placed a breakpoint in the=C2=A0ki= ll=C2=A0method of=C2=A0LocalContainer=C2=A0(https://github.com/apache/storm/blob/2ba95bbd1c911d= 4fc6363b1c4b9c4c6d86ac9aae/storm-server/src/main/java/org/apache/storm/daem= on/supervisor/LocalContainer.java#L66) to try and understand why the wo= rker was getting restarted.=C2=A0

The call stack w= as:
kill:66, LocalContainer (org.apache.storm.daemon.supervi=
sor)
killContainerFor:269, Slot (org.apache.storm.daemon.supervisor)
handleRunning:724, Slot (org.apache.storm.daemon.supervisor)
stateMachineStep:218, Slot (org.apache.storm.daemon.supervisor)
run:931, Slot (org.apache.storm.daemon.supervisor)

With this I can understand that the worker = is killed because a blob has changed (ht= tps://github.com/apache/storm/blob/2ba95bbd1c911d4fc6363b1c4b9c4c6d86ac9aae= /storm-server/src/main/java/org/apache/storm/daemon/supervisor/Slot.java#L7= 24). In fact, there's a changing blob in the=C2=A0dynamicState=C2=A0at that point.

I checked the=C2=A0Async= Localizer=C2=A0which downloads, caches blobs locally, and notifies the = Slot state machine of a changing blob.

I noticed t= his:

Which tell me that (correct me if I'm= wrong):
  • Supervisor tries to update blobs each 30 seconds= .
  • The topology jar blob requires extraction of the resources direct= ory (either from a jar or directly in a classpath URL). It does so in=C2=A0= fetchUnzipToTemp=C2=A0and it's existence is checked in=C2=A0i= sFullyDownloaded.
  • The Slot is notified of a changing blob if:
    • the remote version is different from the local version (the code= has changed).
    • OR the blob is not fully downloaded (the jar exists,= and the extracted resources directory exists).
We= ll, I did not have a resources folder under the root of the classpath, and = that's why the worker was being restarted each ~30 seconds, as the Slot= was being notified of a changing blob everytime=C2=A0updateBlobs=C2= =A0ran.=C2=A0
I created a resources folder (with dummy files) und= er the root of the classpath and the problem is now solved.

<= /div>
However, if I understand correctly, the resources folder is only = required for=C2=A0multilang. Our topologies do not use=C2=A0multi= lang=C2=A0and this do not happen in Storm 1.1.3 for instance.

Am I seeing or doing something wrong and this is an expecte= d behaviour?=C2=A0
I am happy to contribute if this is in fact so= mething worth to open an issue and fix.

Hope this = is the right place for these questions, and thanks in advance for taking yo= ur time to look at this.

Regards,
Diogo<= /div>
--000000000000c1e25e05912e3bbf--