From user-zh-return-3323-archive-asf-public=cust-asf.ponee.io@flink.apache.org Wed Apr 29 23:41:25 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 27860180654 for ; Thu, 30 Apr 2020 01:41:25 +0200 (CEST) Received: (qmail 14245 invoked by uid 500); 29 Apr 2020 23:41:17 -0000 Mailing-List: contact user-zh-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user-zh@flink.apache.org Delivered-To: mailing list user-zh@flink.apache.org Received: (qmail 14100 invoked by uid 99); 29 Apr 2020 23:41:17 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Apr 2020 23:41:17 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 671291A31DE; Wed, 29 Apr 2020 23:41:16 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id z4fW0s8fW9vR; Wed, 29 Apr 2020 23:41:14 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::e31; helo=mail-vs1-xe31.google.com; envelope-from=eleanore.jin@gmail.com; receiver= Received: from mail-vs1-xe31.google.com (mail-vs1-xe31.google.com [IPv6:2607:f8b0:4864:20::e31]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id C10427D3FA; Wed, 29 Apr 2020 23:41:13 +0000 (UTC) Received: by mail-vs1-xe31.google.com with SMTP id g2so2606653vsb.4; Wed, 29 Apr 2020 16:41:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=uA4m5qpbt5S1bw/hc1SAO2oS/U3cW4j2umkfuxkRVKs=; b=MQM4yYDqh/PrZRn0hQY2EUNRWz3tyDAzJ7+s9D1iNgegWj2rWp4nqSg09E3RlUVNox r6ctdkIm641/SnhdNn8v0/CC2hiQf7rQth+eJTZKtbwFM0q3/ZGFbnIY/ZMSe2KvsHOv Jd+F9OMZagxkwGaOTXgA4sl0Nnf62AyAipwkG8X9Ew3mUgQKXovi/5mVrevhv0EGpZnf KUoBCTzbtC8ctor5L9Ms9EWnwYmMjLqUJcXruuGJrFfzysrO+TBQvXoAS7ZszQbM0K+R gz7J3LtnIa+z68h4obEe2WTNNeJwHR+/h5CK0JdQRZScISk1PMHtB5QzEMu+EE+wDFFb +qFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=uA4m5qpbt5S1bw/hc1SAO2oS/U3cW4j2umkfuxkRVKs=; b=g0OgxSciKnbUjpQjsRajjv8uA+pxjTtqckx/rraCPyhYQSb+PQXXGByHtXW1S5xN1Q bghCTJz5E4ZfdLU16rq/LOn4ufohX3Ejr9PPk98ksQOHpehaQOVNFs2KkkmAPEHzpX5T 4asddbiTGphHfiXcQgGMXIj39CU8E3PK72Jgoi+Mc7Ig1J+3S3o4xYTI3ozGiRZu5SlJ VIe8yflLaIbTuzXuF+3/Pi/Qmzf2+RGZZ/NFb0av6uYjGyxtNTdIHziPD3NXQVab5mq4 95sxcIZLygp3naT9MVJ6EfhMlFl1Y4QFOT13RHKNDN8YS9SZEli5zJ8ZM8MAbtkihti1 /W7Q== X-Gm-Message-State: AGi0PubLLAKuXG1tL2RIeB2f5m0wmV5qPvupUfSulVC9G+oFCLL7cIW4 SQeUW85J+xHvwUaksfCORzM4gDyLdemYUBoWdfhp3oLu36g= X-Google-Smtp-Source: APiQypLOU4tn0gGyJzrnkWdH9XHPu42lWhnKEr2CrMN17H/0aq+sHGM3j7Ywz+FYF8f26bSpNGohUaBd9YgTQZUAXNg= X-Received: by 2002:a67:f152:: with SMTP id t18mr735141vsm.96.1588203664893; Wed, 29 Apr 2020 16:41:04 -0700 (PDT) MIME-Version: 1.0 From: Eleanore Jin Date: Wed, 29 Apr 2020 16:40:53 -0700 Message-ID: Subject: Flink Task Manager GC overhead limit exceeded To: user , user-zh Content-Type: multipart/related; boundary="00000000000047ae6905a4767b5c" --00000000000047ae6905a4767b5c Content-Type: multipart/alternative; boundary="00000000000047ae6805a4767b5b" --00000000000047ae6805a4767b5b Content-Type: text/plain; charset="UTF-8" Hi All, Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4 pods, each pod with 4 parallelism. The flink job reads from a source topic with 96 partitions, and does per element filter, the filtered value comes from a broadcast topic and it always use the latest message as the filter criteria, then publish to a sink topic. There is no checkpointing and state involved. Then I am seeing GC overhead limit exceeded error continuously and the pods keep on restarting So I tried to increase the heap size for task manager by containers: - args: - task-manager - -Djobmanager.rpc.address=service-job-manager - -Dtaskmanager.heap.size=4096m - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps/oom.bin" 3 things I noticed, 1. I dont see the heap size from UI for task manager show correctly [image: image.png] 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did I set the java opts wrong? 3. I continously seeing below logs from all pods, not sure if causes any issue {"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the fetch request with (sessionId=2054451921, epoch=474): FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000} Thanks a lot for any help! Best, Eleanore --00000000000047ae6805a4767b5b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi All,=C2=A0

Currently I am runnin= g a flink job cluster (v1.8.2) on kubernetes with 4 pods, each pod with 4 p= arallelism.=C2=A0

The flink job reads from a sourc= e topic with 96 partitions, and does per element filter, the filtered value= comes from a broadcast topic and it always use the latest message as the f= ilter criteria, then publish to a sink topic.=C2=A0

There is no checkpointing and state involved.=C2=A0

<= div>Then I am seeing GC overhead limit exceeded error continuously=C2=A0and= the pods keep on restarting

So I tried to increas= e the heap size for task manager by

containers:

=C2=A0 =C2=A0 =C2=A0 - args:

=C2=A0 =C2=A0 =C2=A0 =C2=A0 - task-manager

=C2=A0 =C2=A0 =C2=A0 =C2=A0 - -Djobmanager.rpc.address= =3Dservice-job-manager

=C2=A0 =C2=A0 =C2=A0 =C2=A0 - -Dtaskmanager.heap.size= =3D4096m

=C2=A0 =C2=A0 =C2=A0 =C2=A0 - -Denv.java.opts.taskmanag= er=3D"-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=3D/dumps/oom.bi= n"


3 things I noticed,=C2=A0


1. I dont=C2=A0see the heap size from = UI for task manager show correctly

3D"image.pn=

2. I do= nt=C2=A0see the heap dump file in the restarted pod /dumps/oom.bin, did I s= et the java opts wrong?

3. I continously=C2=A0seei= ng below logs from all pods, not sure if causes any issue
{"@timestamp":"2020-04-29T23:39:4= 3.387Z","@version":"1","message":"[= Consumer clientId=3Dconsumer-1, groupId=3Daba774bc] Node 6 was unable to pr= ocess the fetch request with (sessionId=3D2054451921, epoch=3D474): FETCH_S= ESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.c= lients.FetchSessionHandler","thread_name":"pool-6-threa= d-1","level":"INFO","level_value":20000}=

<= font color=3D"#000000" face=3D"Menlo">Thanks a lot for any help!

=
Best,
Eleanore
--00000000000047ae6805a4767b5b-- --00000000000047ae6905a4767b5c--