kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukasz Mierzwa (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-6436) Provide a metric indicating broker cluster membership state
Date Wed, 10 Jan 2018 04:58:03 GMT
Lukasz Mierzwa created KAFKA-6436:

             Summary: Provide a metric indicating broker cluster membership state
                 Key: KAFKA-6436
                 URL: https://issues.apache.org/jira/browse/KAFKA-6436
             Project: Kafka
          Issue Type: Wish
          Components: metrics
            Reporter: Lukasz Mierzwa
            Priority: Minor

When deploying kafka config changes each instance needs to be restarted (since there's no
graceful reload) and that requires coordination to keep all partitions on-line. Part of the
automation I have waits after restarting each instance until restarted broker is back in sync
on all partitions, to do that I query for:

kafka.server:name=BrokerState,type=KafkaServer to be 3 (broker is up & running)
kafka.server:clientId=Replica,name=MaxLag,type=ReplicaFetcherManager = 0 (there's no lag)

I've noticed that there's a race for the MaxLag metric - when replica fetcher threads are
starting this metric will be initialized with 0 value, then (I assume) once all threads connect
to the leaders it's populated with "correct" MaxLag value computed from all those threads.
This means that there's a window where I can query for those metrics and get expected BrokerState=3
and MaxLag=0 which would I interpret as "done restarting this instance" but a few seconds
later MaxLag might jump to a huge value.
Right now my workaround is to require multiple queries to return expected metric values, which
seems to protect me from hitting that window.
It would be nice if there was a metric like "ClusterState" initialized as 0 that would be
set to 1 only once all replica fetcher threads are started, completed reconnecting to the
leaders and proper MaxLag is set (or there's no replicas on given broker).
Alternatively MaxLag could be just initialized with -1 and set to 0 later if that's the actual
max lag computed after getting replication offsets from leaders (if that would work).

If there was a "ClusterState" metric it could also be used to signal if a broker loses connectivity
with the rest of the cluster, I don't there is such metric right now (is there?).

This message was sent by Atlassian JIRA

View raw message