flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jelmer <jkupe...@gmail.com>
Subject Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?
Date Tue, 26 Jun 2018 15:28:43 GMT
Hi Chesnay, sorry for the late reply. I did not have time to look into this
sooner

I did what you suggested. Added some logging to the PrometheusReporter like
this :

https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a

And deployed a custom build of the reporter to our test environment.

I managed to reproduce the issue like this

1. Deploy job A : it lands on worker 1
2. Deploy job B : it lands on worker 1, take note of the job id
3. Redeploy job b by canceling it from a savepoint and deploying it again
from the savepoint : it lands on worker 3
4. Execute curl -s http://localhost:9249/metrics | grep "job id from step
2" on worker 1. The metrics are still exposed even though the job is
canceled

I attached a piece of the log to the email. What I notice is that the two
jobs register metrics with the same scoped metric name. In this case
flink_taskmanager_job_task_buffers_inputQueueLength.

The prometheus exporter seems to use reference counting for the metrics and
the metrics will only be removed when the count is 0, canceling job B will
lower the counter by 5 but because job A still is deployed the count is not
1 so the metric never gets unregistered

Canceling job A will remove the lingering metrics from the old job B

It seems to me that this is a bug and that the childs that are being added
in notifyOfAddedMetric
<https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a#diff-36ff6f170e359d30a1265b43659443bfR163>
should be removed in notifyOfRemovedMetric

Can you confirm this ?


--Jelmer



On Fri, 15 Jun 2018 at 18:01, Chesnay Schepler <chesnay@apache.org> wrote:

> I remember that another user reported something similar, but he wasn't
> using the PrometheusReporter. see
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html
>
> We couldn't find the cause, but my suspicion was FLINK-8946 which will be
> fixed in 1.4.3 .
> You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and see
> whether this actually caused it.
>
> Alternatively, if you can reproduce this it would be immensely helpful if
> you could modify the PrometheusReporter and log all notifications about
> added or removed metrics.
>
> On 15.06.2018 15:42, Till Rohrmann wrote:
>
> Hi,
>
> this sounds very strange. I just tried it out locally with with a standard
> metric and the Prometheus metrics seem to be unregistered after the job has
> reached a terminal state. Thus, it looks as if the standard metrics are
> properly removed from `CollectorRegistry.defaultRegistry`. Could you check
> the log files whether they contain anything suspicious about a failed
> metric deregistration a la `There was a problem unregistering metric`?
>
> I've also pulled in Chesnay who knows more about the metric reporters.
>
> Cheers,
> Till
>
> On Thu, Jun 14, 2018 at 11:34 PM jelmer <jkuperus@gmail.com> wrote:
>
>> Hi
>>
>> We are using flink-metrics-prometheus for reporting on apache flink 1.4.2
>>
>> And I am looking into an issue where it seems that somehow in some cases
>> the metrics registered
>> by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
>> (flink_taskmanager_job_task_buffers_outPoolUsage etc)  are not being
>> unregistered in prometheus in case of a job restart
>>
>> Eventually this seems to cause a java.lang.NoClassDefFoundError:
>> org/apache/kafka/common/metrics/stats/Rate$1 error when a new version of
>> the job is deployed  because the jar file
>> in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon
>> deployment of the new job but the url classloader still points to it and it
>> cannot find stats/Rate$1 (some synthetically generated code generated by
>> the java compiler because its a switch on an enum)
>>
>> Has anybody come across this issue ? Has it possibly been fixed in 1.5 ?
>> Can somebody any pointers as to where to look to tackle this ?
>>
>> Attached screenshot shows what classloader that cannot be garbage
>> collected with the gc root
>>
>>
>

Mime
View raw message