From user-return-31487-archive-asf-public=cust-asf.ponee.io@flink.apache.org Tue Dec 17 02:07:31 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id F36D3180658 for ; Tue, 17 Dec 2019 03:07:30 +0100 (CET) Received: (qmail 15776 invoked by uid 500); 17 Dec 2019 02:07:28 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 15766 invoked by uid 99); 17 Dec 2019 02:07:28 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Dec 2019 02:07:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 13A69C0BE8 for ; Tue, 17 Dec 2019 02:07:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ae0wDWEmQHaS for ; Tue, 17 Dec 2019 02:07:26 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.166.196; helo=mail-il1-f196.google.com; envelope-from=reedpor@gmail.com; receiver= Received: from mail-il1-f196.google.com (mail-il1-f196.google.com [209.85.166.196]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id A74C7BC555 for ; Tue, 17 Dec 2019 02:07:26 +0000 (UTC) Received: by mail-il1-f196.google.com with SMTP id t17so7068194ilm.13 for ; Mon, 16 Dec 2019 18:07:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=5fncDRn/an888gPPDdOu7eBJUVDE4dco+uLP8ZOeOzE=; b=l6fqpS7/MD7GK0roiuwjiI0XCJOfSWJbVFLwuo9OdbTv8LUjqNvEK7QMQre5uxO1h0 ITk+BONUoB1NC+Rz4NyA/DuxNi5DoBs8tvqQvtSZK24YFmnuqFMqdKirUELf1DfUfqcV S+5+4EH2jELfImyJy91QXcGFZfIXyyVuEOkT5FV9S1oHyZb0u6gvWOcURlctxKIq5Nqr 6NUPECABNEBdn0nFzJ9Rm7AcMmCGEoc2mVVOxfp7H1HWeJHRR2L/2ZIp1m22kbAC5qT0 BLeHJyvFPlwDLXodl/ZGM5IAisaX3VShjB2mp+z4C7ebvENNd1gMSh6fOHNPf9oDAJXz A+pQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=5fncDRn/an888gPPDdOu7eBJUVDE4dco+uLP8ZOeOzE=; b=BkzgGtx5ynDS1sBKz6ZBtzCqo5ox7qRYgldOPxSWuAGOt7xh5vRpIcmy+v+ByC+PJp 716Gs+jpInFiBBQujPuP/bpmZXvpqwIn+3JRfUV4cm+GuCWe2QcWu9QBdkNhO9CsTOCg xxp244Ye2kE03jIQ3oomByMb3stb5VzuszGS1MtSXQ3WZ1miy52lWbKBi1u2kYHCDZXd EmuntlIs2OohLfeZ2MxAyV8Sh7q/0ukj+OJOUbTvgc5lwabhMLBzhqFZ40/rw2gAa0A7 9VMDPdLOVqr/Nj+r81AiFRkdg+SPvM7xW6HcZVpJLWdp2EQ0q31nS1HwOIWyPzMk1ied 9uTA== X-Gm-Message-State: APjAAAULYKziMY5s3hYHD+PP/PdkIEBUWjOiYKoXb7+He25UjNULXRB+ ySxNkY0fnYzgdF1N1kqVcrq951MC+aw4iT4KDAQ= X-Google-Smtp-Source: APXvYqzc48fN4gA7mrxXJNGtKNq3vy3CPXmPmmDBd+SBft643yFYBax13tiJm1gncIOV1r3G+2j38h3XEPcOvX7pXj4= X-Received: by 2002:a92:ccd0:: with SMTP id u16mr14047146ilq.215.1576548440480; Mon, 16 Dec 2019 18:07:20 -0800 (PST) MIME-Version: 1.0 References: <55EC5C2E-CA5B-46F3-9F12-CD83AB4EC611@comcast.com> In-Reply-To: From: Zhu Zhu Date: Tue, 17 Dec 2019 10:06:23 +0800 Message-ID: Subject: Re: [EXTERNAL] Flink and Prometheus monitoring question To: =?UTF-8?B?SmVzw7pzIFbDoXNxdWV6?= Cc: "PoolakkalMukkath, Shakir" , user Content-Type: multipart/alternative; boundary="000000000000c4245f0599dcc9c0" --000000000000c4245f0599dcc9c0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Jes=C3=BAs, If your job has checkpointing enabled, you can monitor 'numberOfCompletedCheckpoints' to see wether the job is still alive and healthy. Thanks, Zhu Zhu Jes=C3=BAs V=C3=A1squez =E4=BA=8E2019=E5=B9= =B412=E6=9C=8817=E6=97=A5=E5=91=A8=E4=BA=8C =E4=B8=8A=E5=8D=882:43=E5=86=99= =E9=81=93=EF=BC=9A > The thing about numRunningJobs metric is that i have to configure in > advance the Prometheus rules with the number of jobs i expect to be runni= ng > in order to alert, i kind of need this rule to alert on individual jobs. = I > initially thought of flink_jobmanager_downtime{job_id=3D~".*"} =3D=3D -1 = , bit it > resulted that the metric just emits 0 on running jobs, and doesn't emit -= 1 > for failed jobs. > > El lun., 16 dic. 2019 7:01 p. m., PoolakkalMukkath, Shakir < > Shakir_PoolakkalMukkath@comcast.com> escribi=C3=B3: > >> You could use =E2=80=9Cflink_jobmanager_numRunningJobs=E2=80=9D to check= the number of >> running jobs. >> >> >> >> Thanks >> >> >> >> *From: *Jes=C3=BAs V=C3=A1squez >> *Date: *Monday, December 16, 2019 at 12:47 PM >> *To: *"user@flink.apache.org" >> *Subject: *[EXTERNAL] Flink and Prometheus monitoring question >> >> >> >> Hi, >> >> I want to monitor Flink Streaming jobs using Prometheus >> >> My first goal is to send alerts when a Flink job has failed. >> >> The thing is that looking at the documentation I haven't found a metric >> that helps me defining an alerting rule. >> >> As a starting point i thought that the metric >> flink_jobmanager_job_downtime could help since the doc says this metric >> emits -1 for a completed job. >> >> But when i tested this i found out this doesn't work since the metric >> always emits 0 and after the job is completed there is no metric. >> >> Has anyone managed to alert when flink job has failed with Prometheus? >> >> Thanks for your help. >> > --000000000000c4245f0599dcc9c0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi=C2=A0Jes=C3=BAs,=C2=A0
If your job has checkpointin= g enabled, you can monitor 'numberOfCompletedCheckpoints' to see we= ther the job is still alive and healthy.

Thanks,
Zhu Zhu

Jes=C3=BAs V=C3=A1squez <jesusvasquezr1998@gmail.com> =E4=BA=8E2019=E5= =B9=B412=E6=9C=8817=E6=97=A5=E5=91=A8=E4=BA=8C =E4=B8=8A=E5=8D=882:43=E5=86= =99=E9=81=93=EF=BC=9A
The thing about numRunningJobs metric is that i hav= e to configure in advance the Prometheus rules with the number of jobs i ex= pect to be running in order to alert, i kind of need this rule to alert on = individual jobs. I initially thought of flink_jobmanager_downtime{job_id=3D= ~".*"} =3D=3D -1 , bit it resulted that the metric just emits 0 o= n running jobs, and doesn't emit -1 for failed jobs.

El lun., 16 dic. 20= 19 7:01 p. m., PoolakkalMukkath, Shakir <Shakir_PoolakkalMukkath@comcast.c= om> escribi=C3=B3:

You could use =E2=80=9Cflink_jobmanager_numRunningJo= bs=E2=80=9D to check the number of running jobs.

=C2=A0

Thanks

=C2=A0

From: = Jes=C3=BAs V=C3=A1squ= ez <jesusvasquezr1998@gmail.com>
Date: Monday, December 16, 2019 at 12:47 PM
To: "user@flink.apache.org" <user@flink.apach= e.org>
Subject: [EXTERNAL] Flink and Prometheus monitoring question<= u>

=C2=A0

Hi,

I want to monitor Flink Streaming jobs using Prometh= eus

My first goal is to send alerts when a Flink job has= failed.

The thing is that looking at the documentation I hav= en't found a metric that helps me defining an alerting rule.<= /u>

As a starting point i thought that the metric flink_= jobmanager_job_downtime could help since the doc says this metric emits -1 = for a completed job.

But when i tested this i found out this doesn't = work since the metric always emits 0 and after the job is completed there i= s no metric.

Has anyone managed to alert when flink job has faile= d with Prometheus?

Thanks for your help.

--000000000000c4245f0599dcc9c0--