Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 65F4B200D02 for ; Sat, 23 Sep 2017 13:12:04 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 644AF1609B6; Sat, 23 Sep 2017 11:12:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 81F651609B5 for ; Sat, 23 Sep 2017 13:12:03 +0200 (CEST) Received: (qmail 92193 invoked by uid 500); 23 Sep 2017 11:12:02 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 92183 invoked by uid 99); 23 Sep 2017 11:12:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 23 Sep 2017 11:12:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id BC4061A2AAD for ; Sat, 23 Sep 2017 11:12:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.63 X-Spam-Level: ** X-Spam-Status: No, score=2.63 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id ZwEVFJKow_nu for ; Sat, 23 Sep 2017 11:12:00 +0000 (UTC) Received: from mail-qk0-f182.google.com (mail-qk0-f182.google.com [209.85.220.182]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 2A3CD5F4E5 for ; Sat, 23 Sep 2017 11:12:00 +0000 (UTC) Received: by mail-qk0-f182.google.com with SMTP id q8so3086136qkl.12 for ; Sat, 23 Sep 2017 04:12:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=JSP0ZpAb5aOL+7ci/VZjI42cAr3OaWg+6NxZZt5R16w=; b=haBXlYqzAgv5YncTqTGYiwJ9D1pyu272Rfv8asxqv29jYUZpr4Uic1F+891zO5kEZ1 C3cp9zzGDPiH9eyoEjcbTj107E1PVZzIjgYlhrxVG28vqdNyV3wHeZCje8R0ltInxYoG pPP11d8sniH4wIl7Tv4YutNgG3FuHh4T9n3tCpO4Rw4NnQuLhxCVQTdjJEXqhdiFrOHU WU+BqHFv8LvGvviWoBEkz/y86sqxz4/X27GXuj2HmG0vrylAzTdCYKX9qyA2MNtnW8Gg omQv/xckUC4/rD5UO9MiyIRuWDiO8QSOJn+Bt+IDmPj7a3rVznfUb8hHZRXU3trOxdRd 88og== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=JSP0ZpAb5aOL+7ci/VZjI42cAr3OaWg+6NxZZt5R16w=; b=claighnRcrqX0C59s1RyfL/xO/U+Xm+GjJFD38wc9EYjHEXzMTueQlTOBk8G6W4iSG zXu0hLyRA3SPz6kVVLfOMrbLb8NToeF2HKszf7flblzXGg9on0ouCmyJon8F58eIN921 3lsoyN/0NUh2vSa4FO1kL7dMkJ7Ql/rt3IZfXxLhmOyan0/7bVi596SJILDeMq+Hs6/y a7vpixp6LYgld06QniCcKPOstncY2wHP5YOtq6eF4/2sho1LzzJpneNHPUb+XamPC+/U Y6t6XPK2t0Vyyg6YghfNYrljNw5X1aKNwWrmFqfkjTWSUB9+j27GC4CVUtG7CvASiWNl FIiw== X-Gm-Message-State: AHPjjUiQbYjmbKvFN1Mpp7VG6+v2MaY4FFkaK0hyg69dvtnKiJkod2Pk itdOdsK7Ob8mKV/0ZaaGvzfTZ+vTbNOFTAZtCL4= X-Google-Smtp-Source: AOwi7QA4MXFp3bFfABAgPO1NP2TvZqyOuy9ZP9xhrmBCbIlizqKnzfXo9fCmNitCofqkU0yRvHHaqursqH1Ytna6j8I= X-Received: by 10.55.77.86 with SMTP id a83mr2673051qkb.172.1506165114653; Sat, 23 Sep 2017 04:11:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.140.97.73 with HTTP; Sat, 23 Sep 2017 04:11:54 -0700 (PDT) In-Reply-To: References: From: Tony Wei Date: Sat, 23 Sep 2017 19:11:54 +0800 Message-ID: Subject: Re: Get EOF from PrometheusReporter in JM To: Chesnay Schepler Cc: user Content-Type: multipart/alternative; boundary="001a114a8496a1d0100559d964c6" archived-at: Sat, 23 Sep 2017 11:12:04 -0000 --001a114a8496a1d0100559d964c6 Content-Type: text/plain; charset="UTF-8" Hi Chesnay, I built another flink cluster using version 1.4, set the log level to DEBUG, and I found that the root cause might be this exception: *java.lang.NullPointerException: Value returned by gauge lastCheckpointExternalPath was null*. I updated `CheckpointStatsTracker` to ignore external path when it is null, and this exception didn't happen again. The prometheus reporter works as well. I have created a Jira issue for it: https://issues.apache.org/jira/browse/FLINK-7675 , and I will submit the PR after I passed Travis CI for my repository. Best Regards, Tony Wei 2017-09-22 22:20 GMT+08:00 Tony Wei : > Hi Chesnay, > > I didn't try it in 1.4, so I have no idea if this also occurs in 1.4. > For my setting for logging, It have already set to INFO level, but there > wasn't any error or warning in log file as well. > > Best Regards, > Tony Wei > > 2017-09-22 22:07 GMT+08:00 Chesnay Schepler : > >> The Prometheus reporter should work with 1.3.2. >> >> Does this also occur with the reporter that currently exists in 1.4? (to >> rule out new bugs from the PR). >> >> To investigate this further, please set the logging level to WARN and try >> again, as all errors in the metric system are logged on that level. >> >> >> On 22.09.2017 10:33, Tony Wei wrote: >> >> Hi, >> >> I have built the Prometheus reporter package from this PR >> https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to >> record every default metrics and those from `FlinkKafkaConsumer`. >> >> Originally, everything was fine. I could get those metrics in TM from >> Prometheus just like I saw on Flink Web UI. >> However, when I turned to JM, I found Prometheus gives this error to me: Get >> http://localhost:9249/metrics: EOF. >> I checked the log on JM and saw nothing in it. There was no error message >> and 9249 port was still alive. >> >> To figure out what happened, I created another cluster and I found >> Prometheus could connect to Flink cluster if there is no running job. After >> JM triggered or completed the first checkpoint, Prometheus started getting >> ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in >> log file and 9249 port was still alive. >> >> I was wondering where did the error occur. Flink or Prometheus reporter? >> Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you. >> >> Best Regards, >> Tony Wei >> >> >> > --001a114a8496a1d0100559d964c6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Chesnay,

I built another flink clust= er using version 1.4, set the log level to DEBUG, and I found that the root= cause might be this exception:=C2=A0java.lang.NullPointerException= : Value returned by gauge lastCheckpointExternalPath was null.

I updated `CheckpointStatsTracker` to ignore external path= when it is null, and this exception didn't happen again. The prometheu= s reporter works as well.

I have created a Jira is= sue for it:=C2=A0https://issues.apache.org/jira/browse/FLINK-7675, and=C2=A0I will s= ubmit the PR after I passed Travis CI for my repository.

Best Regards,
Tony Wei

=C2=A0

2017-09-2= 2 22:20 GMT+08:00 Tony Wei <tony19920430@gmail.com>:
Hi Chesnay,

=
I didn't try it in 1.4, so I have no idea if this also occurs in 1= .4.
For my setting for logging, It have already set to INFO level= , but there wasn't any error or warning in log file as well.
=
Best Regards,
Tony Wei

2017-09-22 22:07 GMT+08:00 Chesnay Schepler <chesnay@apache.org>:
=20 =20 =20
The Prometheus reporter should work with 1.3.2.

Does this also occur with the reporter that currently exists in 1.4? (to rule out new bugs from the PR).

To investigate this further, please set the logging level to WARN and try again, as all errors in the metric system are logged on that level.


On 22.09.2017 10:33, Tony Wei wrote:
Hi,=C2=A0

I have built the Prometheus reporter package from this PR=C2= =A0= https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to record every default metrics and those from `FlinkKafkaConsumer`.

Originally, everything was fine. I could get those metrics in TM from Prometheus just like I saw on Flink Web UI.
However, when I turned to JM, I found Prometheus gives this error to me:=C2=A0Get http://localhost:9249/metrics: EOF.
I checked the log on JM and saw nothing in it. There was no error message and 9249 port was still alive.

To figure out what happened, I created another cluster and I found Prometheus could connect to Flink cluster if there is no running job. After JM triggered or completed the first checkpoint, Prometheus started getting=C2=A0ERR_EMPTY_RESPO= NSE=C2=A0from JM, but not for TM. There was still no error in log file and 9249 port was still alive.

I was wondering where did the error occur. Flink or Prometheus reporter?
Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.

Best Regards,
Tony Wei




--001a114a8496a1d0100559d964c6--