ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikolay Izhikov <nizhi...@apache.org>
Subject Re: Partition map exchange metrics
Date Mon, 22 Jul 2019 06:19:38 GMT
Anton.

1. Value exported based on SPI settings, not in the moment it changed.

2. Clock synchronisation - if we export start time, we should also export
node local timestamp.

пн, 22 июля 2019 г., 8:33 Anton Vinogradov <av@apache.org>:

> Folks,
>
> What's the reason for duration counting?
> AFAIU, it's a monitoring system feature to count the durations.
> Sine monitoring system checks metrics periodically it will know the
> duration by its own log.
>
> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <jokserfn@gmail.com>
> wrote:
>
> > Nikita,
> >
> > Yes, I mean duration not timestamp. For the metric name, I suggest
> > "cacheOperationsBlockingDuration", I think it cleaner represents what is
> > blocked during PME.
> > We can also combine both timestamp "cacheOperationsBlockingStartTs" and
> > duration to have better correlation when cache operations were blocked
> and
> > how much time it's taken.
> > For instant view (like in JMX bean) a calculated value as you mentioned
> > can be used.
> > For metrics are exported to some backend (IEP-35) a counter can be used.
> > The counter is incremented by blocking time after blocking has ended.
> >
> > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <nsamelchev@gmail.com>:
> >
> >> Pavel,
> >>
> >> The main purpose of this metric is
> >> >> how much time we wait for resuming cache operations
> >>
> >> Seems I misunderstood you. Do you mean timestamp or duration here?
> >> >> What do you think if we change the boolean value of metric to a long
> >> value that represents time in milliseconds when operations were blocked?
> >>
> >> This time can be calculated as (currentTime -
> >> timeSinceOperationsBlocked) in case of timestamp.
> >>
> >> Duration will be more understandable. It'll be something like
> >> getCurrentBlockingPmeDuration. But I haven't come up with a better
> >> name yet.
> >>
> >> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <jokserfn@gmail.com>:
> >> >
> >> > Nikita,
> >> >
> >> > I think getCurrentPmeDuration doesn't show useful information. The
> main
> >> PME side effect for end-users is blocking cache operations. Not all PME
> >> time blocks it.
> >> > What information gives to an end-user timestamp of
> >> "timeSinceOperationsBlocked"? For what analysis it can be used and how?
> >> >
> >> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <nsamelchev@gmail.com>:
> >> >>
> >> >> Hi Pavel,
> >> >>
> >> >> This time already can be obtained from the getCurrentPmeDuration and
> >> >> new isOperationsBlockedByPme metrics.
> >> >>
> >> >> As an alternative solution, I can rework recently added
> >> >> getCurrentPmeDuration metric (not released yet). Seems for users it
> >> >> useless in case of non-blocking PME.
> >> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when
> >> >> blocking started (minimal value of cluster nodes) and 0 if blocking
> >> >> ends (there is no running PME).
> >> >>
> >> >> WDYT?
> >> >>
> >> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <jokserfn@gmail.com>:
> >> >> >
> >> >> > Hi Nikita,
> >> >> >
> >> >> > Thank you for working on this. What do you think if we change
the
> >> boolean
> >> >> > value of metric to a long value that represents time in
> milliseconds
> >> when
> >> >> > operations were blocked?
> >> >> > Since we have not only JMX and now metrics are periodically
> exported
> >> to
> >> >> > some backend it can give a more clear picture of how much time
we
> >> wait for
> >> >> > resuming cache operations instead of instant boolean indicator.
> >> >> >
> >> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <nsamelchev@gmail.com
> >:
> >> >> >
> >> >> > > Anton, Nikolay,
> >> >> > >
> >> >> > > Thanks for the support.
> >> >> > >
> >> >> > > For now, we have the getCurrentPmeDuration() metric that
does not
> >> show
> >> >> > > influence on the cluster correctly. PME can be without blocking
> >> >> > > operations. For example, client node join/leave events.
> >> >> > >
> >> >> > > I suggest add new metric - isOperationsBlockedByPme(). Together,
> >> these
> >> >> > > metrics will show influence of the PME on cluster and user
> >> operations.
> >> >> > >
> >> >> > > I have prepared PR for this (Bot visa is green). [1] Can
anyone
> >> take a
> >> >> > > look?
> >> >> > >
> >> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> >> >> > >
> >> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> nizhikov@apache.org
> >> >:
> >> >> > >
> >> >> > > >
> >> >> > > > I think administator of Ignite cluster should be able
to
> monitor
> >> all
> >> >> > > Ignite process, including non blocking PME.
> >> >> > > >
> >> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov
пишет:
> >> >> > > > > BTW,
> >> >> > > > > Found PME metric - getCurrentPmeDuration().
> >> >> > > > > Seems, it shows exactly PME time and not so useful
because of
> >> this.
> >> >> > > > > The goal it so show exactly blocking period.
> >> >> > > > > When PME cause no blocking, it's a good PME and
I see no
> >> reason to have
> >> >> > > > > monitoring related to it :)
> >> >> > > > >
> >> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov
<
> >> nizhikov@apache.org>
> >> >> > > wrote:
> >> >> > > > >
> >> >> > > > > > Anton.
> >> >> > > > > >
> >> >> > > > > > Why do we need to postpone implementation
of this metrics?
> >> >> > > > > > For now, implementation of new metric is very
simple.
> >> >> > > > > >
> >> >> > > > > > I think we can implement this metrics as a
single
> >> contribution.
> >> >> > > > > >
> >> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton
Vinogradov пишет:
> >> >> > > > > > > Nikita,
> >> >> > > > > > >
> >> >> > > > > > > Looks like all we need now is a 1 simple
metric: are
> >> operations
> >> >> > > blocked?
> >> >> > > > > > > Just a true or false.
> >> >> > > > > > > Lest start from this.
> >> >> > > > > > > All other metrics can be extracted from
logs now and can
> be
> >> >> > > implemented
> >> >> > > > > > > later.
> >> >> > > > > > >
> >> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay
Izhikov <
> >> >> > > nizhikov@apache.org>
> >> >> > > > > > > wrote:
> >> >> > > > > > >
> >> >> > > > > > > > +1.
> >> >> > > > > > > >
> >> >> > > > > > > > Nikita, please, go ahead.
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > вт, 16 июля 2019 г., 11:45
Nikita Amelchev <
> >> nsamelchev@gmail.com
> >> >> > > >:
> >> >> > > > > > > >
> >> >> > > > > > > > > Hello, Igniters.
> >> >> > > > > > > > >
> >> >> > > > > > > > > I suggest to add some useful
metrics about the
> >> partition map
> >> >> > > exchange
> >> >> > > > > > > > > (PME). For now, the duration
of PME stages available
> >> only in
> >> >> > > log
> >> >> > > > > >
> >> >> > > > > > files
> >> >> > > > > > > > > and cannot be obtained using
JMX or other external
> >> tools. [1]
> >> >> > > > > > > > >
> >> >> > > > > > > > > I made the list of local node
metrics that help to
> >> understand
> >> >> > > the
> >> >> > > > > > > > > actual status of current PME:
> >> >> > > > > > > > >
> >> >> > > > > > > > > 1. initialVersion. Topology
version that initiates
> the
> >> >> > > exchange.
> >> >> > > > > > > > > 2. initTime. Time PME was started.
> >> >> > > > > > > > > 3. initEvent. Event that triggered
PME.
> >> >> > > > > > > > > 4. partitionReleaseTime. Time
when a node has
> finished
> >> waiting
> >> >> > > for
> >> >> > > > > >
> >> >> > > > > > all
> >> >> > > > > > > > > updates and translations on
a previous topology.
> >> >> > > > > > > > > 5. sendSingleMessageTime. Time
when a node sent a
> >> single
> >> >> > > message.
> >> >> > > > > > > > > 6. recieveFullMessageTime.
Time when a node received
> a
> >> full
> >> >> > > message.
> >> >> > > > > > > > > 7. finishTime. Time PME was
ended.
> >> >> > > > > > > > >
> >> >> > > > > > > > > When new PME started all these
metrics resets.
> >> >> > > > > > > > >
> >> >> > > > > > > > > These metrics help to understand:
> >> >> > > > > > > > > - how long PME was (current
or previous).
> >> >> > > > > > > > > - how long awaited for all
updates was completed.
> >> >> > > > > > > > > - what node blocks PME (didn't
send a single message)
> >> >> > > > > > > > > - what triggered PME.
> >> >> > > > > > > > >
> >> >> > > > > > > > > Thoughts?
> >> >> > > > > > > > >
> >> >> > > > > > > > > [1]
> https://issues.apache.org/jira/browse/IGNITE-11961
> >> >> > > > > > > > >
> >> >> > > > > > > > > --
> >> >> > > > > > > > > Best wishes,
> >> >> > > > > > > > > Amelchev Nikita
> >> >> > > > > > > > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > Best wishes,
> >> >> > > Amelchev Nikita
> >> >> > >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best wishes,
> >> >> Amelchev Nikita
> >>
> >>
> >>
> >> --
> >> Best wishes,
> >> Amelchev Nikita
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message