Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 561C3166794 for ; Tue, 22 Aug 2017 10:21:58 +0200 (CEST) Received: (qmail 735 invoked by uid 500); 22 Aug 2017 08:21:55 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 725 invoked by uid 99); 22 Aug 2017 08:21:55 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Aug 2017 08:21:55 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 92E1C1A033B for ; Tue, 22 Aug 2017 08:21:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.541 X-Spam-Level: *** X-Spam-Status: No, score=3.541 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, HTML_OBFUSCATE_10_20=1.162, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id FSajyhW5Ej0I for ; Tue, 22 Aug 2017 08:21:53 +0000 (UTC) Received: from mail-qk0-f182.google.com (mail-qk0-f182.google.com [209.85.220.182]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id CCF8C5FE3A for ; Tue, 22 Aug 2017 08:21:52 +0000 (UTC) Received: by mail-qk0-f182.google.com with SMTP id k126so10525127qkb.5 for ; Tue, 22 Aug 2017 01:21:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=k/3AXwY7Nyw2nWQr7wZ+J0HohDXGm12z3hXTiP0FUW4=; b=Vok9V+boLJSMLchF3LbGWm1uw7iSOPFwszeMV4JBopnQ6NsKW9z5aqRcA35/sgqyPQ CHDl8NDixmWAxD3jL+7ZOYMgHqKrsFC1UHRbRS6YUe4JZpreXluR50JNLQ58xZPXHMsZ AbuESCe0InOplola7b/tyAM+hHjS+/scL2IqaSwg7BAndEbc8ddw0+Plf48NTGaZKIJp bSh6ma7o8E5t9LNT5C6607IrLhliuFJMRAtz0MMxYuO+Herm5AE+OIHeYhglt325LpDl luUWAbrYv40HX9JwWooVkpDJ4KudjqL2gVqGALa5p4RJ+lvBNACpvf2NcJv/uaeveSgY Bn9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=k/3AXwY7Nyw2nWQr7wZ+J0HohDXGm12z3hXTiP0FUW4=; b=rM+kzcgsOkYdYiALr8z7ArFxPYcHVpwk0DZxoXqYvEUYg6htARG0X+mTgr+5+xtzRg zqc2QThbDxT9k8JC/BBKcH7zp8n4Cao1d7GVIqQfePYgKoH3U2dNbBnsa+FMHLipLuf8 M2nZQ8mMOw/tf20m040jc1XbF5XtQYM1qVM7A5FpJQAmiaOSjseCZb0VMaptQCKB7XBi Q95F7gpn+EfDWnroRkOYQAO9tizT/C4DJt16NQDm3rT6d5Wd88I2ZJDgWq4Yogf2oRkv OPgbVmNLj3xKn1DCLteGgxdTk7iaddUAtqRUw1+ttyAuZyizdlPEaU3wF4xDR/f8lpKn c5ZA== X-Gm-Message-State: AHYfb5hULHzivD9X+3kwW6fWbOTaBUdoPtS2SY3g5sw+qyg6gsd8T9h6 RbAqe5mWKo2Zlpxw7OiE8UxoWxranA== X-Received: by 10.55.18.157 with SMTP id 29mr1785942qks.87.1503390112491; Tue, 22 Aug 2017 01:21:52 -0700 (PDT) MIME-Version: 1.0 Received: by 10.12.168.214 with HTTP; Tue, 22 Aug 2017 01:21:11 -0700 (PDT) In-Reply-To: References: From: Till Rohrmann Date: Tue, 22 Aug 2017 10:21:11 +0200 Message-ID: Subject: Re: akka timeout To: Steven Wu Cc: Chesnay Schepler , user Content-Type: multipart/alternative; boundary="001a114750069d2d3505575349ed" --001a114750069d2d3505575349ed Content-Type: text/plain; charset="UTF-8" Hi Steven, quick correction for Flink 1.2. Indeed the MetricFetcher does not pick up the right timeout value from the configuration. Instead it uses a hardcoded 10s timeout. This has only been changed recently and is already committed in the master. So with the next release 1.4 it will properly pick up the right timeout settings. Just out of curiosity, what's the instability issue you're observing? Cheers, Till On Fri, Aug 18, 2017 at 7:07 PM, Steven Wu wrote: > Till/Chesnay, thanks for the answers. Look like this is a result/symptom > of underline stability issue that I am trying to track down. > > It is Flink 1.2. > > On Fri, Aug 18, 2017 at 12:24 AM, Chesnay Schepler > wrote: > >> The MetricFetcher always use the default akka timeout value. >> >> >> On 18.08.2017 09:07, Till Rohrmann wrote: >> >> Hi Steven, >> >> I thought that the MetricFetcher picks up the right timeout from the >> configuration. Which version of Flink are you using? >> >> The timeout is not a critical problem for the job health. >> >> Cheers, >> Till >> >> On Fri, Aug 18, 2017 at 7:22 AM, Steven Wu wrote: >> >>> >>> We have set akka.ask.timeout to 60 s in yaml file. I also confirmed the >>> setting in Flink UI. But I saw akka timeout of 10 s for metric query >>> service. two questions >>> 1) why doesn't metric query use the 60 s value configured in yaml file? >>> does it always use default 10 s value? >>> 2) could this cause heartbeat failure between task manager and job >>> manager? or is this jut non-critical failure that won't affect job health? >>> >>> Thanks, >>> Steven >>> >>> 2017-08-17 23:34:33,421 WARN org.apache.flink.runtime.webmonitor.metrics.MetricFetcher >>> - Fetching metrics failed. akka.pattern.AskTimeoutException: Ask timed >>> out on [Actor[akka.tcp://flink@1.2.3.4:39139/user/MetricQueryServic >>> e_23cd9db754bb7d123d80e6b1c0be21d6]] after [10000 ms] at >>> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) >>> at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at >>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599) >>> at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) >>> at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597) >>> at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474) >>> at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425) >>> at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429) >>> at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381) >>> at java.lang.Thread.run(Thread.java:748) >>> >> >> >> > --001a114750069d2d3505575349ed Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Steven,

quick correction for Flink 1= .2. Indeed the MetricFetcher does not pick up the right timeout value from = the configuration. Instead it uses a hardcoded 10s timeout. This has only b= een changed recently and is already committed in the master. So with the ne= xt release 1.4 it will properly pick up the right timeout settings.

Just out of curiosity, what's the instability issue y= ou're observing?

Cheers,
Till
<= /div>

On Fri, Aug = 18, 2017 at 7:07 PM, Steven Wu <stevenz3wu@gmail.com> wro= te:
Till/Chesnay, thanks= for the answers. Look like this is a result/symptom of underline stability= issue that I am trying to track down.

It is Flink 1.2.<= /div>

On Fri, Aug 18, 2017 at 12:24 AM, Chesnay= Schepler <chesnay@apache.org> wrote:
=20 =20 =20
The MetricFetcher always use the default akka timeout value.


On 18.08.2017 09:07, Till Rohrmann wrote:
Hi Steven,

I thought that the MetricFetcher picks up the right timeout from the configuration. Which version of Flink are you using?

The timeout is not a critical problem for the job health.

Cheers,
Till

On Fri, Aug 18, 2017 at 7:22 AM, Steven Wu <stevenz3wu@gmail.com> wrote:

We have set akka.ask.timeout to 60 s in yaml file. I also confirmed the setting in Flink UI. But I saw akka timeout of 10 s for metric query service. two questions
1) why doesn't metric query use the 60 s value configured in yaml file? does it always use default 10 s value?
2) could this cause heartbeat failure between task manager and job manager? or is this jut non-critical failure that won't affect job health?

Thanks,
Steven

2= 017-08-17 23:34:33,421 WARN org.apache.flink.runtime.webmonitor.metri= cs.MetricFetcher - Fetching metrics failed. akka.pattern.AskTimeoutException: Ask timed out on [Acto= r[akka.tcp://flink@1.2.3.4:39139/user/MetricQueryService_23cd9db754bb7d123d80e6b= 1c0be21d6]] after [10000 ms] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSuppo= rt.scala:334) at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedEx= ecute(Future.scala:599) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecu= tor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Fut= ure.scala:597) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask= (Scheduler.scala:474) at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$= 1(Scheduler.scala:425) at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Sched= uler.scala:429) at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.= scala:381) at java.lang.Thread.run(Thread.java:748)




--001a114750069d2d3505575349ed--