Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 60AD9200497 for ; Wed, 23 Aug 2017 19:01:58 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 61AB416165A; Wed, 23 Aug 2017 17:01:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5B3FA1615B3 for ; Wed, 23 Aug 2017 19:01:50 +0200 (CEST) Received: (qmail 13937 invoked by uid 500); 23 Aug 2017 17:01:47 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 13927 invoked by uid 99); 23 Aug 2017 17:01:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Aug 2017 17:01:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 49E1BC0169 for ; Wed, 23 Aug 2017 17:01:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.4 X-Spam-Level: X-Spam-Status: No, score=-0.4 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, HTML_OBFUSCATE_05_10=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id U6aTuchnxsFD for ; Wed, 23 Aug 2017 17:01:45 +0000 (UTC) Received: from mail-wr0-f174.google.com (mail-wr0-f174.google.com [209.85.128.174]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 3A5AB5F570 for ; Wed, 23 Aug 2017 17:01:45 +0000 (UTC) Received: by mail-wr0-f174.google.com with SMTP id p8so2233110wrf.5 for ; Wed, 23 Aug 2017 10:01:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=pzkQnsMWCwzy6f2d5D3iv3SnORWLFcLwmMyJPsWn+K0=; b=FELcSvVpFvCfYr9zMnZLH1tPK1UgnTA9CThgJFM7INso6qgXajuhw4N8I4nYSYzn0+ qgJx85aZB1AJicp/tdUBRDrKvyc9tZCYaiUx5TBJf6Wv4eQUaXpc5Vsz3WqvEJDR9T/I mxtyh8f3+aw+adkPLWzSsqZrkgiFYxiUprluqV4EfBEWmZllXE63L6hitNhBcocOW9QR A4pgZvTR2JAIHwtDYbIb7JvK3T4hYRLS5GWS46s4sMxbW75jcRxHKX3lkwoGjVIgNxS4 Y+FbTY2St7o3/HFAEm/QLHxE2GwzQgURXyWnfubLD5yZv6uh+VJdmM1G/fRp/dXj4Ol9 w4aA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=pzkQnsMWCwzy6f2d5D3iv3SnORWLFcLwmMyJPsWn+K0=; b=UNu/0lNgEQN7rJx/5E6Xpb+bqnJEKslR8SOq8zO1/4/veOKGD5z7hxow0+HqrhvhR3 mz1iRIr7XHZ4O73MLmG/tQAJYrv3TEu/dLin2VdkeTBTijP3f1sZqdj6uU1ZNCcrMhfr IVbDFOL6pI+uq3HJh/BZ8KP3tqbwPbYWPc02MpAb7I37YpnufDsngyoClQ9nMDQOU1Bx 18lE8mpXTa2eSM3Onc3aqp3/l7rDEe6IEA/or3jSvK+Lk+zNMeNQthvZy52ZHvNRc+Us nFE6RzlcXponek0ZAHpY2r8XQgD/dyk1KO9c+n3sCisNx43P27mDzhxpkCeoRVqZxo18 0PFw== X-Gm-Message-State: AHYfb5h/kjOe/WOyW8xYySY7UccNuuR+jX0BG66n/0wTdnAn/MTkIeII QFdGd2yOm1VS8+bybMztyFVMvKP88A== X-Received: by 10.223.169.235 with SMTP id b98mr1868043wrd.126.1503507704221; Wed, 23 Aug 2017 10:01:44 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.147.165 with HTTP; Wed, 23 Aug 2017 10:01:43 -0700 (PDT) In-Reply-To: References: From: Steven Wu Date: Wed, 23 Aug 2017 10:01:43 -0700 Message-ID: Subject: Re: akka timeout To: Till Rohrmann Cc: Chesnay Schepler , user Content-Type: multipart/alternative; boundary="001a113f692ca096bb05576eaadf" archived-at: Wed, 23 Aug 2017 17:01:58 -0000 --001a113f692ca096bb05576eaadf Content-Type: text/plain; charset="UTF-8" Till, Once our job was restarted for some reason (e.g. taskmangaer container got killed), it can stuck in continuous restart loop for hours. Right now, I suspect it is caused by GC pause during restart, our job has very high memory allocation in steady state. High GC pause then caused akka timeout, which then caused jobmanager to think taksmanager containers are unhealthy/dead and kill them. And the cycle repeats... But I hasn't been able to prove or disprove it yet. When I was asking the question, I was still sifting through metrics and error logs. Thanks, Steven On Tue, Aug 22, 2017 at 1:21 AM, Till Rohrmann wrote: > Hi Steven, > > quick correction for Flink 1.2. Indeed the MetricFetcher does not pick up > the right timeout value from the configuration. Instead it uses a hardcoded > 10s timeout. This has only been changed recently and is already committed > in the master. So with the next release 1.4 it will properly pick up the > right timeout settings. > > Just out of curiosity, what's the instability issue you're observing? > > Cheers, > Till > > On Fri, Aug 18, 2017 at 7:07 PM, Steven Wu wrote: > >> Till/Chesnay, thanks for the answers. Look like this is a result/symptom >> of underline stability issue that I am trying to track down. >> >> It is Flink 1.2. >> >> On Fri, Aug 18, 2017 at 12:24 AM, Chesnay Schepler >> wrote: >> >>> The MetricFetcher always use the default akka timeout value. >>> >>> >>> On 18.08.2017 09:07, Till Rohrmann wrote: >>> >>> Hi Steven, >>> >>> I thought that the MetricFetcher picks up the right timeout from the >>> configuration. Which version of Flink are you using? >>> >>> The timeout is not a critical problem for the job health. >>> >>> Cheers, >>> Till >>> >>> On Fri, Aug 18, 2017 at 7:22 AM, Steven Wu wrote: >>> >>>> >>>> We have set akka.ask.timeout to 60 s in yaml file. I also confirmed the >>>> setting in Flink UI. But I saw akka timeout of 10 s for metric query >>>> service. two questions >>>> 1) why doesn't metric query use the 60 s value configured in yaml file? >>>> does it always use default 10 s value? >>>> 2) could this cause heartbeat failure between task manager and job >>>> manager? or is this jut non-critical failure that won't affect job health? >>>> >>>> Thanks, >>>> Steven >>>> >>>> 2017-08-17 23:34:33,421 WARN org.apache.flink.runtime.webmonitor.metrics.MetricFetcher >>>> - Fetching metrics failed. akka.pattern.AskTimeoutException: Ask timed >>>> out on [Actor[akka.tcp://flink@1.2.3.4:39139/user/MetricQueryServic >>>> e_23cd9db754bb7d123d80e6b1c0be21d6]] after [10000 ms] at >>>> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) >>>> at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at >>>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599) >>>> at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) >>>> at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597) >>>> at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474) >>>> at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425) >>>> at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429) >>>> at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381) >>>> at java.lang.Thread.run(Thread.java:748) >>>> >>> >>> >>> >> > --001a113f692ca096bb05576eaadf Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Till,

Once our job was restarted for so= me reason (e.g. taskmangaer container got killed), it can stuck in continuo= us restart loop for hours. Right now, I suspect it is caused by GC pause du= ring restart, our job has very high memory allocation in steady state.=C2= =A0High GC pause then caused akka timeout, which then caused jobmanager to = think taksmanager containers are unhealthy/dead and kill them. And the cycl= e repeats...

But I hasn't been able to prove o= r disprove it yet. When I was asking the question, I was still sifting thro= ugh metrics and error logs.

Thanks,
Stev= en


On Tue, Aug 22, 2017 at 1:21 AM, Till Rohrmann <till.roh= rmann@gmail.com> wrote:
Hi Steven,

quick correction for Flink 1.2.= Indeed the MetricFetcher does not pick up the right timeout value from the= configuration. Instead it uses a hardcoded 10s timeout. This has only been= changed recently and is already committed in the master. So with the next = release 1.4 it will properly pick up the right timeout settings.
=
Just out of curiosity, what's the instability issue you&= #39;re observing?

Cheers,
Till

On Fri, Aug 18, 2017 at 7:07 PM, Steven Wu <st= evenz3wu@gmail.com> wrote:
=
Till/Chesnay, thanks for the answers. Look like this is a = result/symptom of underline stability issue that I am trying to track down.=

It is Flink 1.2.

On Fri, Aug 18, 2017 at 12:24 AM, Che= snay Schepler <chesnay@apache.org> wrote:
=20 =20 =20
The MetricFetcher always use the default akka timeout value.


On 18.08.2017 09:07, Till Rohrmann wrote:
Hi Steven,

I thought that the MetricFetcher picks up the right timeout from the configuration. Which version of Flink are you using?

The timeout is not a critical problem for the job health.

Cheers,
Till

On Fri, Aug 18, 2017 at 7:22 AM, Steven Wu <stevenz3wu@gmail.com> wrote:

We have set akka.ask.timeout to 60 s in yaml file. I also confirmed the setting in Flink UI. But I saw akka timeout of 10 s for metric query service. two questions
1) why doesn't metric query use the 60 s value configured in yaml file? does it always use default 10 s value?
2) could this cause heartbeat failure between task manager and job manager? or is this jut non-critical failure that won't affect job health?

Thanks,
Steven

2= 017-08-17 23:34:33,421 WARN org.apache.flink.runtime.webmonitor.metri= cs.MetricFetcher - Fetching metrics failed. akka.pattern.AskTimeoutException: Ask timed out on [Acto= r[akka.tcp://flink@1.2.3.4:39139/user/MetricQueryService_23cd9db754bb7d123d80e6b= 1c0be21d6]] after [10000 ms] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSuppo= rt.scala:334) at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedEx= ecute(Future.scala:599) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecu= tor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Fut= ure.scala:597) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask= (Scheduler.scala:474) at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$= 1(Scheduler.scala:425) at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Sched= uler.scala:429) at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.= scala:381) at java.lang.Thread.run(Thread.java:748)





--001a113f692ca096bb05576eaadf--