From user-return-31751-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Sun Jan  5 02:28:37 2020
Return-Path: <user-return-31751-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 472C418065E
	for <archive-asf-public@cust-asf.ponee.io>; Sun,  5 Jan 2020 03:28:37 +0100 (CET)
Received: (qmail 96325 invoked by uid 500); 5 Jan 2020 02:28:35 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 96315 invoked by uid 99); 5 Jan 2020 02:28:34 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Jan 2020 02:28:34 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 8BC00C01CF
	for <user@flink.apache.org>; Sun,  5 Jan 2020 02:28:33 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0
X-Spam-Level:
X-Spam-Status: No, score=0 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001,
	SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id sL4Xze1QoIff for <user@flink.apache.org>;
	Sun,  5 Jan 2020 02:28:30 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.167.50; helo=mail-lf1-f50.google.com; envelope-from=java.dev.mtl@gmail.com; receiver=<UNKNOWN> 
Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com [209.85.167.50])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id C429ABC7EB
	for <user@flink.apache.org>; Sun,  5 Jan 2020 02:28:29 +0000 (UTC)
Received: by mail-lf1-f50.google.com with SMTP id y19so34208034lfl.9
        for <user@flink.apache.org>; Sat, 04 Jan 2020 18:28:29 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=9yGZlzo9rK2j2rgb88bKStgTcNpKFGgc+kGiexout6k=;
        b=B+DXSc0XvEhWJoDdM1lVRABIrJWAeknhr3IkYNFhGMfFnpw4lg67NxmsPVfEhtrc4k
         gjggRcT0GHEY7XGSNegjYO9jru2ptVx7YPtS7ogdB4yzOS/MyKUiCHI5bdKAXr9AcRa5
         neUdvD7WJ++ITDXdcsCgaux+QYUY7oVpRXxasnY2e6eT6YTq8MXCh3AOxROCfR/bEw4t
         nGj/BOIb89SYdIavHxWLWphu9uzCLLmcdnBkVZTcwj2/1pFI1x5cXPnvLHhPexFKh2A+
         bqdVULta+EPTh94YlGlNAaREh9mwQ2nqaK+Rrf0/u+nA4G5NOQ6blaZjE1PItxrZXz35
         /MuQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=9yGZlzo9rK2j2rgb88bKStgTcNpKFGgc+kGiexout6k=;
        b=XxMT9AVu91ML6Ohs9lzVSoVJEIx2f9pr9BkyQR+W2gO9+21GYJoYPDrwuS7Z7RS6jh
         7zRBxkhvjYq3Ug94Vk84DxKGew3ofqY0jWIBkEGYDsaDAuyApDy+KZ6120rKwZz1gKh7
         +hB7U9JGoMqAyeBAjVeLtLktXbUi48F6d6hxYLpcKodBhmlxNySkWshxJbgkjcAiw6nH
         4SPiULokyoQQwnG+VgfgnplfASjW7gyQmAGjlqSMs03dEwJifeQfSFxsvQPC5ei7IuH6
         QkIrNyq4NYuYnN07nS6Esd0oQsb7I+Fozcj4H1EwWSrqMEoZIvhY4SLKim/NocRJrHzg
         cN3Q==
X-Gm-Message-State: APjAAAXGuIJDTPA0nI9CkvjeiQxYMLwibo+kn+Huwv+q0zEUPwlRG78K
	91lYaF2t9x41sx/KIXyPhkCNgkvqTjgAXIzjK0A=
X-Google-Smtp-Source: APXvYqwliVTnGhWyd4WJ9YEW5qzYr8UJAzUK6CqVCEhX4Zv8T0X8c7D0mGMZbS+YjZ9GwPQeSKeBET9yFfBYO3iyiGg=
X-Received: by 2002:a05:6512:284:: with SMTP id j4mr51355155lfp.109.1578191308518;
 Sat, 04 Jan 2020 18:28:28 -0800 (PST)
MIME-Version: 1.0
References: <CAMiEuFSn4-RA5M60Xb1m8csF9tfw9zLb=pwSJ6Zev0MLq3nw-g@mail.gmail.com>
 <c4114a25-f740-4a6c-98be-a74c7ea2b966.wangzhijiang999@aliyun.com>
 <CAMiEuFR+vC6XYpMuQGy_2B-sqGWLXn9kVq2U2SbacJC8d2QchQ@mail.gmail.com>
 <99eaf2a6-41b4-4a82-adff-d943bfbb778a.wangzhijiang999@aliyun.com>
 <2746c51d-a1ff-bb05-31e5-ccad0d5eb801@apache.org> <CAMiEuFR4O78MAJbN32g7MtQTGZkAwH_Mhu0+DE8JioZt6K32wg@mail.gmail.com>
In-Reply-To: <CAMiEuFR4O78MAJbN32g7MtQTGZkAwH_Mhu0+DE8JioZt6K32wg@mail.gmail.com>
From: John Smith <java.dev.mtl@gmail.com>
Date: Sat, 4 Jan 2020 21:27:52 -0500
Message-ID: <CAMiEuFT57z1RtebzfmQwqEs_mVMAB3q5SuNEf4WLprCcw9SNEA@mail.gmail.com>
Subject: Re: Flink task node shut it self off.
To: Chesnay Schepler <chesnay@apache.org>
Cc: Zhijiang <wangzhijiang999@aliyun.com>, user <user@flink.apache.org>
Content-Type: multipart/alternative; boundary="00000000000054fbc5059b5b4c9c"

--00000000000054fbc5059b5b4c9c
Content-Type: text/plain; charset="UTF-8"

It seems to have happened again... Here is a screen shot of the system
metrics for that day on that particular node....

https://www.dropbox.com/s/iudn7z2fvvy7vb8/flink-node.png?dl=0


On Fri, 3 Jan 2020 at 12:19, John Smith <java.dev.mtl@gmail.com> wrote:

> Well there was this huge IO wait like over 140% spike. IO wait rose slowly
> for couple hours then at some time it spiked at 140% and then after IO wait
> dropped back to "normal" the CPU 1min 5min 15min spiked to like 3 times the
> number of cores for a bit.
>
> We where at "peek" operation. I.e we where running a batch job when this
> hapenned. On average operation the "business" requests per second from our
> services is about 15 RPS when we do batches we can hit 600 RPS for a few
> hours and then back down. Each business request underneath does a few round
> trips back and forth between Kafka, cache systems Flink, DBs etc... So
> Flink jobs are a subset of some parts of that 600 RPS.
>
> On Flink side we 3 task managers of 4 cores 8GB which are configured as 8
> slots, 5.4GB JVM, 3.77GB flink managed mem per task manager. We have 8 jobs
> and 9 slots free. So the cluster isn't full yet. But we do see one node is
> full.
>
> We use disk FS state (backed by GlusterFS) not rocks DB. We had enabled 5
> second checkpointing for 6 of the jobs... So just wondering if that was
> possibly the reason for the IO wait... But regardless of the RPS mentioned
> above the jobs will always checkpoint every 5 seconds... I had the chance
> to increase checkpointing for a few of the jobs before the holidays. I am
> back on Monday...
>
> On Fri., Jan. 3, 2020, 11:16 a.m. Chesnay Schepler, <chesnay@apache.org>
> wrote:
>
>> The logs show 2 interesting pieces of information:
>>
>> <tasks are submitted>
>> ...
>> 2019-12-19 18:33:23,278 INFO
>> org.apache.kafka.clients.FetchSessionHandler                  - [Consumer
>> clientId=consumer-4, groupId=ccccccdb-prod-import] Error sending fetch
>> request (sessionId=INVALID, epoch=INITIAL) to node 0:
>> org.apache.kafka.common.errors.DisconnectException.
>> ...
>> 2019-12-19 19:37:06,732 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not
>> resolve ResourceManager address akka.tcp://flink@xxxxxx-job-0002:36835/user/resourcemanager,
>> retrying in 10000 ms: Ask timed out on
>> [ActorSelection[Anchor(akka.tcp://flink@xxxxxx-job-0002:36835/),
>> Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message
>> of type "akka.actor.Identify"..
>>
>> This reads like the machine lost network connectivity for some reason.
>> The tasks start failing because kafka cannot be reached, and the TM then
>> shuts down because it can neither reach the ResourceManager.
>>
>> On 25/12/2019 04:34, Zhijiang wrote:
>>
>> If you use rocksDB state backend, it might consume extra native memory.
>> Some resource framework cluster like yarn would kill the container if the
>> memory usage exceeds some threshold. You can also double check whether it
>> exists in your case.
>>
>> ------------------------------------------------------------------
>> From:John Smith <java.dev.mtl@gmail.com> <java.dev.mtl@gmail.com>
>> Send Time:2019 Dec. 25 (Wed.) 03:40
>> To:Zhijiang <wangzhijiang999@aliyun.com> <wangzhijiang999@aliyun.com>
>> Cc:user <user@flink.apache.org> <user@flink.apache.org>
>> Subject:Re: Flink task node shut it self off.
>>
>> The shutdown happened after the massive IO wait. I don't use any state
>> Checkpoints are disk based...
>>
>> On Mon., Dec. 23, 2019, 1:42 a.m. Zhijiang, <wangzhijiang999@aliyun.com>
>> wrote:
>> Hi John,
>>
>> Thanks for the positive comments of Flink usage. No matter at least-once
>> or exactly-once you used for checkpoint, it would never lose one message
>> during failure recovery.
>>
>> Unfortunatelly I can not visit the logs you posted. Generally speaking the
>> longer internal checkpoint would mean replaying more source data after
>> failure recovery.
>> In my experience the 5 seconds interval for checkpoint is too frequently
>> in my experience, and you might increase it to 1 minute or so. You can also
>> monitor how long will the checkpoint finish in your application, then you
>> can adjust the interval accordingly.
>>
>> Concerning of the node shutdown you mentioned, I am not quite sure
>> whether it is relevant to your short checkpoint interval. Do you config to
>> use heap state backend?  The hs_err file really indicated that you job
>> had encountered the memory issue, then it is better to somehow increase
>> your task manager memory. But if you can analyze the dump hs_err file via
>> some profiler tool for checking the memory usage, it might be more helpful
>> to find the root cause.
>>
>> Best,
>> Zhijiang
>>
>> ------------------------------------------------------------------
>> From:John Smith <java.dev.mtl@gmail.com>
>> Send Time:2019 Dec. 21 (Sat.) 05:26
>> To:user <user@flink.apache.org>
>> Subject:Flink task node shut it self off.
>>
>> Hi, using Flink 1.8.0
>>
>> 1st off I must say Flink resiliency is very impressive, we lost a node
>> and never lost one message by using checkpoints and Kafka. Thanks!
>>
>> The cluster is a self hosted cluster and we use our own zookeeper
>> cluster. We have...
>> 3 zookeepers: 4 cpu, 8GB (each)
>> 3 job nodes: 4 cpu, 8GB (each)
>> 3 task nodes: 4 cpu, 8GB (each)
>> The nodes also share GlusterFS for storing savepoints and checkpoints,
>> GlusterFS is running on the same machines.
>>
>> Yesterday a node shut itself off we the following log messages...
>> - Stopping TaskExecutor
>> akka.tcp://flink@xxx.xxx.xxx.73:34697/user/taskmanager_0.
>> - Stop job leader service.
>> - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>> - Shutting down TaskExecutorLocalStateStoresManager.
>> - Shutting down BLOB cache
>> - Shutting down BLOB cache
>> - removed file cache directory
>> /tmp/flink-dist-cache-4b60d79b-1cef-4ffb-8837-3a9c9a205000
>> - I/O manager removed spill file directory
>> /tmp/flink-io-c9d01b92-2809-4a55-8ab3-6920487da0ed
>> - Shutting down the network environment and its components.
>>
>> Prior to the node shutting off we noticed massive IOWAIT of 140% and CPU
>> load 1minute of 15. And we also got an hs_err file which sais we should
>> increase the memory.
>>
>> I'm attaching the logs here:
>> https://www.dropbox.com/sh/vp1ytpguimiayw7/AADviCPED47QEy_4rHsGI1Nya?dl=0
>>
>> I wonder if my 5 second checkpointing is too much for gluster.
>>
>> Any thoughts?
>>
>>
>>
>>
>>
>>
>>
>>

--00000000000054fbc5059b5b4c9c
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">It seems to have happened again... Here is a screen shot o=
f the system metrics for that day on that particular node....<br><br><a hre=
f=3D"https://www.dropbox.com/s/iudn7z2fvvy7vb8/flink-node.png?dl=3D0">https=
://www.dropbox.com/s/iudn7z2fvvy7vb8/flink-node.png?dl=3D0</a><br><br></div=
><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fr=
i, 3 Jan 2020 at 12:19, John Smith &lt;<a href=3D"mailto:java.dev.mtl@gmail=
.com">java.dev.mtl@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"g=
mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204=
,204,204);padding-left:1ex"><div dir=3D"auto">Well there was this huge IO w=
ait like over 140% spike. IO wait rose slowly for couple hours then at some=
 time it spiked at 140% and then after IO wait dropped back to &quot;normal=
&quot; the CPU 1min 5min 15min spiked to like 3 times the number of cores f=
or a bit.<div dir=3D"auto"><br></div><div dir=3D"auto">We where at &quot;pe=
ek&quot; operation. I.e we where running a batch job when this hapenned. On=
 average operation the &quot;business&quot; requests per second from our se=
rvices is about 15 RPS when we do batches we can hit 600 RPS for a few hour=
s and then back down. Each business request underneath does a few round tri=
ps back and forth between Kafka, cache systems Flink, DBs etc... So Flink j=
obs are a subset of some parts of that 600 RPS.</div><div dir=3D"auto"><br>=
</div><div dir=3D"auto">On Flink side we 3 task managers of 4 cores 8GB whi=
ch are configured as 8 slots, 5.4GB JVM, 3.77GB flink managed mem per task =
manager. We have 8 jobs and 9 slots free. So the cluster isn&#39;t full yet=
. But we do see one node is full.</div><div dir=3D"auto"><br></div><div dir=
=3D"auto">We use disk FS state (backed by GlusterFS) not rocks DB. We had e=
nabled 5 second checkpointing for 6 of the jobs... So just wondering if tha=
t was possibly the reason for the IO wait... But regardless of the RPS ment=
ioned above the jobs will always checkpoint every 5 seconds... I had the ch=
ance to increase checkpointing for a few of the jobs before the holidays. I=
 am back on Monday...</div></div><br><div class=3D"gmail_quote"><div dir=3D=
"ltr" class=3D"gmail_attr">On Fri., Jan. 3, 2020, 11:16 a.m. Chesnay Schepl=
er, &lt;<a href=3D"mailto:chesnay@apache.org" target=3D"_blank">chesnay@apa=
che.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-lef=
t:1ex">
 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF">
    <div>The logs show 2 interesting pieces of
      information:</div>
    <div><br>
    </div>
    <div>&lt;tasks are submitted&gt;</div>
    <div>...<br>
    </div>
    <div>2019-12-19 18:33:23,278 INFO=C2=A0
      org.apache.kafka.clients.FetchSessionHandler=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 -
      [Consumer clientId=3Dconsumer-4, groupId=3Dccccccdb-prod-import] Erro=
r
      sending fetch request (sessionId=3DINVALID, epoch=3DINITIAL) to node
      0: org.apache.kafka.common.errors.DisconnectException.</div>
    <div>...</div>
    <div>2019-12-19 19:37:06,732 INFO=C2=A0
      org.apache.flink.runtime.taskexecutor.TaskExecutor=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -
      Could not resolve ResourceManager address
      akka.tcp://flink@xxxxxx-job-0002:36835/user/resourcemanager,
      retrying in 10000 ms: Ask timed out on
      [ActorSelection[Anchor(akka.tcp://flink@xxxxxx-job-0002:36835/),
      Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent
      message of type &quot;akka.actor.Identify&quot;..<br>
    </div>
    <div><br>
    </div>
    <div>This reads like the machine lost
      network connectivity for some reason. The tasks start failing
      because kafka cannot be reached, and the TM then shuts down
      because it can neither reach the ResourceManager.<br>
    </div>
    <div><br>
    </div>
    <div>On 25/12/2019 04:34, Zhijiang wrote:<br>
    </div>
    <blockquote type=3D"cite">
     =20
      <div>
        <div style=3D"line-height:1.7;font-family:Tahoma,Arial,STHeiti,SimS=
un;font-size:14px;color:rgb(0,0,0)">
          <div style=3D"clear:both">If you use rocksDB=C2=A0state backend, =
it
            might consume extra native memory.=C2=A0</div>
          <div style=3D"clear:both">Some resource framework cluster like
            yarn would kill the container if the memory usage exceeds
            some threshold.=C2=A0You can also double check whether it exist=
s
            in your case.</div>
          <div style=3D"clear:both"><br>
          </div>
          <blockquote style=3D"margin-right:0px;margin-top:0px;margin-botto=
m:0px;font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14px;color:rgb(0,0,=
0)">
            <div style=3D"clear:both">-------------------------------------=
-----------------------------</div>
            <div style=3D"clear:both">From:John Smith
              <a href=3D"mailto:java.dev.mtl@gmail.com" rel=3D"noreferrer" =
target=3D"_blank">&lt;java.dev.mtl@gmail.com&gt;</a></div>
            <div style=3D"clear:both">Send Time:2019 Dec. 25 (Wed.) 03:40</=
div>
            <div style=3D"clear:both">To:Zhijiang
              <a href=3D"mailto:wangzhijiang999@aliyun.com" rel=3D"noreferr=
er" target=3D"_blank">&lt;wangzhijiang999@aliyun.com&gt;</a></div>
            <div style=3D"clear:both">Cc:user
              <a href=3D"mailto:user@flink.apache.org" rel=3D"noreferrer" t=
arget=3D"_blank">&lt;user@flink.apache.org&gt;</a></div>
            <div style=3D"clear:both">Subject:Re: Flink task node shut it
              self off.</div>
            <div style=3D"clear:both"><br>
            </div>
            <div>The shutdown happened after the massive IO wait. I
              don&#39;t use any state Checkpoints are disk based...</div>
            <br>
            <div class=3D"gmail_quote">
              <div class=3D"gmail_attr">On Mon., Dec. 23, 2019, 1:42 a.m.
                Zhijiang, &lt;<a href=3D"mailto:wangzhijiang999@aliyun.com"=
 rel=3D"noreferrer" target=3D"_blank">wangzhijiang999@aliyun.com</a>&gt;
                wrote:<br>
              </div>
              <div style=3D"line-height:1.7;font-family:Tahoma,Arial,STHeit=
i,SimSun;font-size:14px;color:rgb(0,0,0)">
                <div style=3D"clear:both">Hi John,</div>
                <div style=3D"clear:both"><br>
                </div>
                <div style=3D"clear:both">Thanks for the positive
                  comments of Flink usage. No matter at=C2=A0least-once or
                  exactly-once you used for checkpoint, it would never
                  lose one message during failure recovery.</div>
                <div style=3D"clear:both"><br>
                </div>
                <div style=3D"clear:both">Unfortunatelly I can not visit
                  the logs you posted. Generally speaking t<span style=3D"f=
ont-size:14px">he longer internal
                    checkpoint would mean</span><span style=3D"font-size:14=
px">=C2=A0replaying more source
                    data after failure recovery.</span></div>
                <div style=3D"clear:both"><span style=3D"font-size:14px">In
                    my experience=C2=A0<span style=3D"color:rgb(0,0,0);font=
-family:Tahoma,Arial,STHeiti,SimSun;font-size:14px;font-style:normal;font-v=
ariant-caps:normal;font-weight:400;text-align:start;text-indent:0px;text-tr=
ansform:none;background-color:rgb(255,255,255);text-decoration:none;float:n=
one;display:inline">the
                      5 seconds interval for checkpoint is too
                      frequently in my experience, and you might
                      increase it to 1 minute or so. You can also
                      monitor how long will the checkpoint finish in
                      your application, then you can adjust the
                      interval=C2=A0accordingly.</span></span></div>
                <div style=3D"clear:both"><span style=3D"font-size:14px;col=
or:rgb(0,0,0);font-family:Tahoma,Arial,STHeiti,SimSun;font-style:normal;fon=
t-variant-caps:normal;font-weight:400;text-align:start;text-indent:0px;text=
-transform:none;background-color:rgb(255,255,255);text-decoration:none;floa=
t:none;display:inline"><br>
                  </span></div>
                <div style=3D"clear:both"><span style=3D"text-align:start;t=
ext-indent:0px;background-color:rgb(255,255,255);float:none;display:inline"=
>Concerning
                    of the node shutdown you mentioned,=C2=A0I am not quite
                    sure whether it is relevant to your short checkpoint
                    interval. Do you config to use heap state backend?=C2=
=A0
                    The=C2=A0<span style=3D"color:rgb(0,0,0);font-family:Ta=
homa,Arial,STHeiti,SimSun;font-size:14px;font-style:normal;font-variant-cap=
s:normal;font-weight:400;text-align:start;text-indent:0px;text-transform:no=
ne;background-color:rgb(255,255,255);text-decoration:none;float:none;displa=
y:inline">hs_err
                      file really indicated that you job had encountered
                      the memory issue, then it is better to somehow
                      increase your task manager memory. But if you can
                      analyze the dump hs_err file via some profiler
                      tool for checking the memory usage, it might be
                      more helpful to find the root cause.</span></span></d=
iv>
                <div style=3D"clear:both"><span style=3D"text-align:start;t=
ext-indent:0px;background-color:rgb(255,255,255);float:none;display:inline;=
color:rgb(0,0,0);font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14px;fon=
t-style:normal;font-variant-caps:normal;font-weight:400;text-transform:none=
;text-decoration:none"><br>
                  </span></div>
                <div style=3D"clear:both"><span style=3D"text-align:start;t=
ext-indent:0px;background-color:rgb(255,255,255);float:none;display:inline;=
color:rgb(0,0,0);font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14px;fon=
t-style:normal;font-variant-caps:normal;font-weight:400;text-transform:none=
;text-decoration:none">Best,</span></div>
                <div style=3D"clear:both"><span style=3D"text-align:start;t=
ext-indent:0px;background-color:rgb(255,255,255);float:none;display:inline;=
color:rgb(0,0,0);font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14px;fon=
t-style:normal;font-variant-caps:normal;font-weight:400;text-transform:none=
;text-decoration:none">Zhijiang=C2=A0</span></div>
                <div style=3D"clear:both"><br>
                </div>
                <div style=3D"clear:both">---------------------------------=
---------------------------------</div>
                <div style=3D"clear:both">From:John Smith &lt;<a href=3D"ma=
ilto:java.dev.mtl@gmail.com" rel=3D"noreferrer" target=3D"_blank">java.dev.=
mtl@gmail.com</a>&gt;</div>
                <div style=3D"clear:both">Send Time:2019 Dec. 21 (Sat.)
                  05:26</div>
                <div style=3D"clear:both">To:user &lt;<a href=3D"mailto:use=
r@flink.apache.org" rel=3D"noreferrer" target=3D"_blank">user@flink.apache.=
org</a>&gt;</div>
                <div style=3D"clear:both">Subject:Flink task node shut it
                  self off.</div>
                <div style=3D"clear:both"><br>
                </div>
                <div>Hi, using Flink 1.8.0
                  <div><br>
                  </div>
                  <div>1st off I must say Flink resiliency is very
                    impressive, we lost a node and never lost one
                    message by using checkpoints and Kafka. Thanks!<br>
                    <br>
                    The cluster is a self hosted cluster and we use our
                    own zookeeper cluster. We have...<br>
                    3 zookeepers: 4 cpu, 8GB (each)<br>
                    3 job nodes: 4 cpu, 8GB (each)</div>
                  <div>3 task nodes: 4 cpu, 8GB (each)</div>
                  <div>The nodes also share GlusterFS for storing
                    savepoints and checkpoints, GlusterFS is running on
                    the same machines.<br>
                    <br>
                    Yesterday a node shut itself off we the following
                    log messages...<br>
                    - Stopping TaskExecutor
                    <a href=3D"mailto:akka.tcp://flink@xxx.xxx.xxx.73:34697=
/user/taskmanager_0" rel=3D"noreferrer" target=3D"_blank">akka.tcp://flink@=
xxx.xxx.xxx.73:34697/user/taskmanager_0</a>.<br>
                    - Stop job leader service.<br>
                    - Stopping ZooKeeperLeaderRetrievalService
                    /leader/resource_manager_lock.<br>
                    - Shutting down TaskExecutorLocalStateStoresManager.<br=
>
                    - Shutting down BLOB cache<br>
                    - Shutting down BLOB cache<br>
                    - removed file cache directory
                    /tmp/flink-dist-cache-4b60d79b-1cef-4ffb-8837-3a9c9a205=
000<br>
                    - I/O manager removed spill file directory
                    /tmp/flink-io-c9d01b92-2809-4a55-8ab3-6920487da0ed<br>
                    - Shutting down the network environment and its
                    components.<br>
                    <br>
                    Prior to the node shutting off we noticed massive
                    IOWAIT of 140% and CPU load 1minute of 15. And we
                    also got an hs_err file which sais we should
                    increase the memory.<br>
                    <br>
                    I&#39;m attaching the logs here:=C2=A0<a href=3D"https:=
//www.dropbox.com/sh/vp1ytpguimiayw7/AADviCPED47QEy_4rHsGI1Nya?dl=3D0" rel=
=3D"noreferrer" target=3D"_blank">https://www.dropbox.com/sh/vp1ytpguimiayw=
7/AADviCPED47QEy_4rHsGI1Nya?dl=3D0</a><br>
                    <br>
                    I wonder if my 5 second checkpointing is too much
                    for gluster.<br>
                    <br>
                    Any thoughts?<br>
                    <br>
                    <br>
                    <br>
                    <br>
                  </div>
                </div>
                <div style=3D"line-height:20px;clear:both"><br>
                </div>
              </div>
            </div>
          </blockquote>
          <div style=3D"line-height:20px;clear:both"><br>
          </div>
        </div>
      </div>
    </blockquote>
    <p><br>
    </p>
  </div>

</blockquote></div>
</blockquote></div>

--00000000000054fbc5059b5b4c9c--