flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tony Wei <tony19920...@gmail.com>
Subject Re: Get EOF from PrometheusReporter in JM
Date Sat, 23 Sep 2017 11:11:54 GMT
Hi Chesnay,

I built another flink cluster using version 1.4, set the log level to
DEBUG, and I found that the root cause might be this exception:
Value returned by gauge lastCheckpointExternalPath was null*.

I updated `CheckpointStatsTracker` to ignore external path when it is null,
and this exception didn't happen again. The prometheus reporter works as

I have created a Jira issue for it:
<https://issues.apache.org/jira/browse/FLINK-7675.>, and I will submit the
PR after I passed Travis CI for my repository.

Best Regards,
Tony Wei

2017-09-22 22:20 GMT+08:00 Tony Wei <tony19920430@gmail.com>:

> Hi Chesnay,
> I didn't try it in 1.4, so I have no idea if this also occurs in 1.4.
> For my setting for logging, It have already set to INFO level, but there
> wasn't any error or warning in log file as well.
> Best Regards,
> Tony Wei
> 2017-09-22 22:07 GMT+08:00 Chesnay Schepler <chesnay@apache.org>:
>> The Prometheus reporter should work with 1.3.2.
>> Does this also occur with the reporter that currently exists in 1.4? (to
>> rule out new bugs from the PR).
>> To investigate this further, please set the logging level to WARN and try
>> again, as all errors in the metric system are logged on that level.
>> On 22.09.2017 10:33, Tony Wei wrote:
>> Hi,
>> I have built the Prometheus reporter package from this PR
>> https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to
>> record every default metrics and those from `FlinkKafkaConsumer`.
>> Originally, everything was fine. I could get those metrics in TM from
>> Prometheus just like I saw on Flink Web UI.
>> However, when I turned to JM, I found Prometheus gives this error to me: Get
>> http://localhost:9249/metrics: EOF.
>> I checked the log on JM and saw nothing in it. There was no error message
>> and 9249 port was still alive.
>> To figure out what happened, I created another cluster and I found
>> Prometheus could connect to Flink cluster if there is no running job. After
>> JM triggered or completed the first checkpoint, Prometheus started getting
>> ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in
>> log file and 9249 port was still alive.
>> I was wondering where did the error occur. Flink or Prometheus reporter?
>> Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.
>> Best Regards,
>> Tony Wei

View raw message