ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikolay Izhikov <nizhi...@apache.org>
Subject Re: Partition map exchange metrics
Date Thu, 25 Jul 2019 16:24:14 GMT
Pavel

Do you have a chance to see HistogramMetric source?
It in master now.
Look in source would be better then my explanation)

We should count PME processes that blocks operations for some amount of
time. For example [less then 50, less then 250, less then 1000, more then
1000] millis.

чт, 25 июля 2019 г., 18:55 Pavel Kovalenko <jokserfn@gmail.com>:

> Nikolay,
>
> Could you please explain deeper what structure will be of PME histogram?
>
> чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov <nizhikov@apache.org>:
>
> > Hello, Nikita.
> >
> > I think
> >
> > > 1. The totalCacheOperationsBlockedDuration metric that will accumulate
> > > all blocking durations that happen after node starts.
> >
> > No, we don't need it.
> >
> > > 2. Blocking duration histogram. Based on the HistogramMetric class.
> >
> > Yes, we need it.
> >
> > В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет:
> > > Igniters,
> > >
> > > All want to see the сacheOperationsBlockedDuration metric that will
> > > show current blocking duration or 0 if there is no blocking right now.
> > >
> > > Do we need the following metrics? It seems one of them will be
> > superfluous.
> > > 1. The totalCacheOperationsBlockedDuration metric that will accumulate
> > > all blocking durations that happen after node starts.
> > > 2. Blocking duration histogram. Based on the HistogramMetric class.
> > > User will be able to configure bounds.
> > >
> > > ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <nizhikov@apache.org>:
> > > >
> > > > Guys.
> > > >
> > > > I think we should go with the 2 metrics
> > > >
> > > >         * current PME duration (resets on finish)
> > > >
> > > >                 This metric required for alerting(or automatic
> > actions) on long PME.
> > > >
> > > >         * PME duration histogram (value added to metrics on PME
> finish)
> > > >                 This metric required for an:
> > > >                         * Quick PME trend analysis
> > > >                         * Quick PME history analysis
> > > >
> > > >
> > > > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > > > > Nikita and Maxim,
> > > > >
> > > > > > What if we just update current metric getCurrentPmeDuration
> > behaviour
> > > > > > to show durations only for blocking PMEs?
> > > > > > Remain it as a long value and rename it to
> > getCacheOperationsBlockedDuration.
> > > > > >
> > > > > > No other changes will require.
> > > > > >
> > > > > > WDYT?
> > > > >
> > > > > I agree with these two metrics. I also think that current
> > > > > getCurrentPmeDuration will become redundant.
> > > > >
> > > > > Anton,
> > > > >
> > > > > > It looks like we're trying to implement "extended debug" instead
> of
> > > > > > "monitoring".
> > > > > > It should not be interesting for real admin what phase of PME is
> in
> > > > > > progress and so on.
> > > > >
> > > > > PME is mission critical cluster process. I agree that there's a
> fine
> > > > > line between monitoring and debug here. However, it's not good to
> add
> > > > > monitoring capabilities only for scenario when everything is
> alright.
> > > > > If PME will really hang, *real admin* will be extremely interested
> > how
> > > > > to return cluster back to working state. Metrics about stages
> > completion
> > > > > time may really help here: e.g. if one specific node hasn't
> completed
> > > > > stage X while rest of the cluster has, it can be a signal that this
> > node
> > > > > should be killed.
> > > > >
> > > > > Of course, it's possible to build monitoring system that extract
> this
> > > > > information from logs, but:
> > > > > - It's more resource intensive as it requires parsing logs for all
> > the time
> > > > > - It's less reliable as log messages may change
> > > > >
> > > > > Best Regards,
> > > > > Ivan Rakov
> > > > >
> > > > > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > > > > Folks,
> > > > > >
> > > > > > +1 with Anton post.
> > > > > >
> > > > > > What if we just update current metric getCurrentPmeDuration
> > behaviour
> > > > > > to show durations only for blocking PMEs?
> > > > > > Remain it as a long value and rename it to
> > getCacheOperationsBlockedDuration.
> > > > > >
> > > > > > No other changes will require.
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <
> > nsamelchev@gmail.com> wrote:
> > > > > > > Nikolay,
> > > > > > >
> > > > > > > The сacheOperationsBlockedDuration metric will show current
> > blocking
> > > > > > > duration or 0 if there is no blocking right now.
> > > > > > >
> > > > > > > The totalCacheOperationsBlockedDuration metric will accumulate
> > all
> > > > > > > blocking durations that happen after node starts.
> > > > > > >
> > > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <
> > nizhikov@apache.org>:
> > > > > > > > Nikita
> > > > > > > >
> > > > > > > > What is the difference between those two metrics?
> > > > > > > >
> > > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <
> > nsamelchev@gmail.com>:
> > > > > > > >
> > > > > > > > > Igniters, thanks for comments.
> > > > > > > > >
> > > > > > > > >  From the discussion it can be seen that we need only two
> > metrics for now:
> > > > > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > > > > >
> > > > > > > > > I will prepare PR at the nearest time.
> > > > > > > > >
> > > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky
> > <arzamas123@mail.ru.invalid
> > > > > > > > > > :
> > > > > > > > > >
> > > > > > > > > > +1 with Anton decisions.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <
> > av@apache.org>:
> > > > > > > > > > >
> > > > > > > > > > > Folks,
> > > > > > > > > > >
> > > > > > > > > > > It looks like we're trying to implement "extended
> debug"
> > instead of
> > > > > > > > > > > "monitoring".
> > > > > > > > > > > It should not be interesting for real admin what phase
> > of PME is in
> > > > > > > > > > > progress and so on.
> > > > > > > > > > > Interested metrics are
> > > > > > > > > > > - total blocked time (will be used for real SLA
> counting)
> > > > > > > > > > > - are we blocked right now (shows we have an SLA
> > degradation right now)
> > > > > > > > > > > Duration of the current blocking period can be easily
> > presented using
> > > > > > > > >
> > > > > > > > > any
> > > > > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > > > > Initial true will means "period start", precision will
> > be a result of
> > > > > > > > > > > checks frequency.
> > > > > > > > > > > Anyway, I'm ok to have current metric presented with
> > long, where long
> > > > > > > > >
> > > > > > > > > is a
> > > > > > > > > > > duration, see no reason, but ok :)
> > > > > > > > > > >
> > > > > > > > > > > All other features you mentioned are useful for code or
> > > > > > > > > > > deployment improving and can (should) be taken from
> logs
> > at the analysis
> > > > > > > > > > > phase.
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov <
> > ivan.glukos@gmail.com >
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > Folks, let me step in.
> > > > > > > > > > > >
> > > > > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > > > > >
> > > > > > > > > > > > > 1. initialVersion. Topology version that initiates
> > the exchange.
> > > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > finished waiting for
> > > > > > > > >
> > > > > > > > > all
> > > > > > > > > > > > > updates and translations on a previous topology.
> > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a
> > single message.
> > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node
> received
> > a full message.
> > > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > > >
> > > > > > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > > > > >
> > > > > > > > > > > > Every metric from Nikita's list looks useful and
> > simple to implement.
> > > > > > > > > > > > I think that it would be better to change format of
> > metrics 4, 5, 6
> > > > > > > > >
> > > > > > > > > and
> > > > > > > > > > > > 7 a bit: we can keep only difference between time of
> > previous event
> > > > > > > > >
> > > > > > > > > and
> > > > > > > > > > > > time of corresponding event. Such metrics would be
> > easier to perceive:
> > > > > > > > > > > > they answer to specific questions "how much time did
> > partition release
> > > > > > > > > > > > take?" or "how much time did awaiting of distributed
> > phase end take?".
> > > > > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported to
> > monitoring system,
> > > > > > > > > > > > graphs will show how different stages times change
> > from one PME to
> > > > > > > > >
> > > > > > > > > another.
> > > > > > > > > > > > > When PME cause no blocking, it's a good PME and I
> > see no reason to
> > > > > > > > >
> > > > > > > > > have
> > > > > > > > > > > > > monitoring related to it
> > > > > > > > > > > >
> > > > > > > > > > > > Agree with Anton here. These metrics should be
> > measured only for true
> > > > > > > > > > > > distributed exchange. Saving results for client
> > leave/join PMEs will
> > > > > > > > > > > > just complicate monitoring.
> > > > > > > > > > > >
> > > > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > > > I still don't understand why instant value
> > indicating that
> > > > > > > > >
> > > > > > > > > operations are
> > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > Duration time since blocking has started looks more
> > appropriate and
> > > > > > > > > > > >
> > > > > > > > > > > > useful.
> > > > > > > > > > > > > It gives more information while semantic is left
> the
> > same.
> > > > > > > > > > > >
> > > > > > > > > > > > Totally agree with Pavel here. Both "accumulated
> block
> > time" and
> > > > > > > > > > > > "current PME block time" metrics are useful. Growth
> of
> > accumulated
> > > > > > > > > > > > metric for specific period of time (should be easy to
> > check via
> > > > > > > > > > > > monitoring system graph) will show for how much
> > business operations
> > > > > > > > >
> > > > > > > > > were
> > > > > > > > > > > > blocked in total, and non-zero current metric will
> > show that we are
> > > > > > > > > > > > experiencing issues right now. Boolean metric "are we
> > blocked right
> > > > > > > > >
> > > > > > > > > now"
> > > > > > > > > > > > is not needed as it's obviously can be inferred from
> > "current PME
> > > > > > > > >
> > > > > > > > > block
> > > > > > > > > > > > time".
> > > > > > > > > > > >
> > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > Ivan Rakov
> > > > > > > > > > > >
> > > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > > > I still don't understand why instant value
> > indicating that
> > > > > > > > >
> > > > > > > > > operations are
> > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > Duration time since blocking has started looks more
> > appropriate and
> > > > > > > > > > > >
> > > > > > > > > > > > useful.
> > > > > > > > > > > > > It gives more information while semantic is left
> the
> > same.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev <
> > nsamelchev@gmail.com
> > > > > > > > > >
> > > > > > > > > > :
> > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > All previous suggestions have some disadvantages.
> > It can be several
> > > > > > > > > > > > > > exchanges between two metric updates and fast
> > exchange can rewrite
> > > > > > > > > > > > > > previous long exchange.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We can introduce a metric of total blocking
> > duration that will
> > > > > > > > > > > > > > accumulate at the end of the exchange. So, users
> > will get actual
> > > > > > > > > > > > > > information about how long operations were
> > blocked. Cluster metric
> > > > > > > > > > > > > > will be a maximum of local nodes metrics. And we
> > need a boolean
> > > > > > > > >
> > > > > > > > > metric
> > > > > > > > > > > > > > that will indicate realtime status. It needs
> > because of duration
> > > > > > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So I propose to change the current metric that
> not
> > released to the
> > > > > > > > > > > > > > totalCacheOperationsBlockingDuration metric and
> to
> > add the
> > > > > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <
> > av@apache.org >:
> > > > > > > > > > > > > > > Nikolay,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Still see no reason to replace boolean with
> long.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay
> Izhikov <
> > > > > > > > >
> > > > > > > > > nizhikov@apache.org >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. Value exported based on SPI settings, not
> > in the moment it
> > > > > > > > >
> > > > > > > > > changed.
> > > > > > > > > > > > > > > > 2. Clock synchronisation - if we export start
> > time, we should
> > > > > > > > >
> > > > > > > > > also
> > > > > > > > > > > > > > export
> > > > > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov <
> > av@apache.org >:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > What's the reason for duration counting?
> > > > > > > > > > > > > > > > > AFAIU, it's a monitoring system feature to
> > count the durations.
> > > > > > > > > > > > > > > > > Sine monitoring system checks metrics
> > periodically it will know
> > > > > > > > >
> > > > > > > > > the
> > > > > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel
> > Kovalenko <
> > > > > > > > >
> > > > > > > > > jokserfn@gmail.com >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Yes, I mean duration not timestamp. For
> > the metric name, I
> > > > > > > > >
> > > > > > > > > suggest
> > > > > > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I
> think
> > it cleaner
> > > > > > > > >
> > > > > > > > > represents
> > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > > > > duration to have better correlation when
> > cache operations were
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > blocked
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > > > > For instant view (like in JMX bean) a
> > calculated value as you
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > > > > For metrics are exported to some backend
> > (IEP-35) a counter
> > > > > > > > >
> > > > > > > > > can be
> > > > > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > > > > The counter is incremented by blocking
> > time after blocking has
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita
> > Amelchev <
> > > > > > > > >
> > > > > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > The main purpose of this metric is
> > > > > > > > > > > > > > > > > > > > > how much time we wait for resuming
> > cache operations
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you mean
> > timestamp or duration
> > > > > > > > >
> > > > > > > > > here?
> > > > > > > > > > > > > > > > > > > > > What do you think if we change the
> > boolean value of metric
> > > > > > > > >
> > > > > > > > > to a
> > > > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > > > > value that represents time in
> > milliseconds when operations
> > > > > > > > >
> > > > > > > > > were
> > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > This time can be calculated as
> > (currentTime -
> > > > > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case of
> > timestamp.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Duration will be more understandable.
> > It'll be something like
> > > > > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But I
> > haven't come up with a
> > > > > > > > >
> > > > > > > > > better
> > > > > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel
> > Kovalenko <
> > > > > > > > >
> > > > > > > > > jokserfn@gmail.com
> > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration doesn't
> > show useful
> > > > > > > > >
> > > > > > > > > information.
> > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > > > > PME side effect for end-users is
> > blocking cache operations.
> > > > > > > > >
> > > > > > > > > Not
> > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > > > > What information gives to an end-user
> > timestamp of
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For what
> > analysis it can be
> > > > > > > > >
> > > > > > > > > used and
> > > > > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, Nikita
> > Amelchev <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > This time already can be obtained
> > from the
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme
> metrics.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > As an alternative solution, I can
> > rework recently added
> > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric (not
> > released yet). Seems for
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > > > > useless in case of non-blocking
> PME.
> > > > > > > > > > > > > > > > > > > > > Lets name it
> > timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > blocking started (minimal value of
> > cluster nodes) and 0 if
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > ends (there is no running PME).
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56, Pavel
> > Kovalenko <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   jokserfn@gmail.com >:
> > > > > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thank you for working on this.
> > What do you think if we
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > change the
> > > > > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > > > > value of metric to a long value
> > that represents time in
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > > > > > Since we have not only JMX and
> now
> > metrics are periodically
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > some backend it can give a more
> > clear picture of how much
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > time we
> > > > > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > > > > resuming cache operations instead
> > of instant boolean
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41,
> > Nikita Amelchev <
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > For now, we have the
> > getCurrentPmeDuration() metric that
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > > > > influence on the cluster
> > correctly. PME can be without
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > operations. For example, client
> > node join/leave events.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > I suggest add new metric -
> > isOperationsBlockedByPme().
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > > > > metrics will show influence of
> > the PME on cluster and user
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > > > > I have prepared PR for this
> (Bot
> > visa is green). [1] Can
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > [1]
> > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в 14:58,
> > Nikolay Izhikov <
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >   nizhikov@apache.org
> > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > I think administator of
> Ignite
> > cluster should be able to
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > Ignite process, including non
> > blocking PME.
> > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57
> > +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > > > > Found PME metric -
> > getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly PME
> > time and not so useful
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > because
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > > > > The goal it so show exactly
> > blocking period.
> > > > > > > > > > > > > > > > > > > > > > > > > When PME cause no blocking,
> > it's a good PME and I see
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > > > > monitoring related to it :)
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at
> 2:50
> > PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Why do we need to
> postpone
> > implementation of this
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > > > > For now, implementation
> of
> > new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > I think we can implement
> > this metrics as a single
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 13:47
> > +0300, Anton Vinogradov
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we need
> > now is a 1 simple metric:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > > > Just a true or false.
> > > > > > > > > > > > > > > > > > > > > > > > > > > Lest start from this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > All other metrics can
> be
> > extracted from logs now
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at
> > 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, go
> > ahead.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 г.,
> > 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello, Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to add
> > some useful metrics about the
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now, the
> > duration of PME stages
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be
> > obtained using JMX or other
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > external
> > > > > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > I made the list of
> > local node metrics that
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > help to
> > > > > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > actual status of
> > current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. initialVersion.
> > Topology version that
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > initiates
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. Time
> > PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent. Event
> > that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4.
> > partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > updates and
> > translations on a previous
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5.
> > sendSingleMessageTime. Time when a node
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6.
> > recieveFullMessageTime. Time when a node
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > received
> > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime. Time
> > PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME
> started
> > all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics help
> > to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long PME was
> > (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long awaited
> > for all updates was
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what node blocks
> > PME (didn't send a single
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what triggered
> PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best wishes,
> > > > > > > > > Amelchev Nikita
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best wishes,
> > > > > > > Amelchev Nikita
> > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message