ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikita Amelchev <nsamelc...@gmail.com>
Subject Re: Re[2]: Partition map exchange metrics
Date Wed, 24 Jul 2019 11:02:29 GMT
Nikolay,

The сacheOperationsBlockedDuration metric will show current blocking
duration or 0 if there is no blocking right now.

The totalCacheOperationsBlockedDuration metric will accumulate all
blocking durations that happen after node starts.

ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <nizhikov@apache.org>:
>
> Nikita
>
> What is the difference between those two metrics?
>
> ср, 24 июля 2019 г., 12:45 Nikita Amelchev <nsamelchev@gmail.com>:
>
> > Igniters, thanks for comments.
> >
> > From the discussion it can be seen that we need only two metrics for now:
> > - сacheOperationsBlockedDuration (long)
> > - totalCacheOperationsBlockedDuration (long)
> >
> > I will prepare PR at the nearest time.
> >
> > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas123@mail.ru.invalid
> > >:
> > >
> > > +1 with Anton decisions.
> > >
> > >
> > > >Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <av@apache.org>:
> > > >
> > > >Folks,
> > > >
> > > >It looks like we're trying to implement "extended debug" instead of
> > > >"monitoring".
> > > >It should not be interesting for real admin what phase of PME is in
> > > >progress and so on.
> > > >Interested metrics are
> > > >- total blocked time (will be used for real SLA counting)
> > > >- are we blocked right now (shows we have an SLA degradation right now)
> > > >Duration of the current blocking period can be easily presented using
> > any
> > > >modern monitoring tool by regular checks.
> > > >Initial true will means "period start", precision will be a result of
> > > >checks frequency.
> > > >Anyway, I'm ok to have current metric presented with long, where long
> > is a
> > > >duration, see no reason, but ok :)
> > > >
> > > >All other features you mentioned are useful for code or
> > > >deployment improving and can (should) be taken from logs at the analysis
> > > >phase.
> > > >
> > > >On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glukos@gmail.com >
> > wrote:
> > > >
> > > >> Folks, let me step in.
> > > >>
> > > >> Nikita, thanks for your suggestions!
> > > >>
> > > >> > 1. initialVersion. Topology version that initiates the exchange.
> > > >> > 2. initTime. Time PME was started.
> > > >> > 3. initEvent. Event that triggered PME.
> > > >> > 4. partitionReleaseTime. Time when a node has finished waiting
for
> > all
> > > >> > updates and translations on a previous topology.
> > > >> > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > >> > 6. recieveFullMessageTime. Time when a node received a full message.
> > > >> > 7. finishTime. Time PME was ended.
> > > >> >
> > > >> > When new PME started all these metrics resets.
> > > >> Every metric from Nikita's list looks useful and simple to implement.
> > > >> I think that it would be better to change format of metrics 4, 5,
6
> > and
> > > >> 7 a bit: we can keep only difference between time of previous event
> > and
> > > >> time of corresponding event. Such metrics would be easier to perceive:
> > > >> they answer to specific questions "how much time did partition release
> > > >> take?" or "how much time did awaiting of distributed phase end take?".
> > > >> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
> > > >> graphs will show how different stages times change from one PME to
> > another.
> > > >>
> > > >> > When PME cause no blocking, it's a good PME and I see no reason
to
> > have
> > > >> > monitoring related to it
> > > >> Agree with Anton here. These metrics should be measured only for true
> > > >> distributed exchange. Saving results for client leave/join PMEs will
> > > >> just complicate monitoring.
> > > >>
> > > >> > I agree with total blocking duration metric but
> > > >> > I still don't understand why instant value indicating that
> > operations are
> > > >> > blocked should be boolean.
> > > >> > Duration time since blocking has started looks more appropriate
and
> > > >> useful.
> > > >> > It gives more information while semantic is left the same.
> > > >> Totally agree with Pavel here. Both "accumulated block time" and
> > > >> "current PME block time" metrics are useful. Growth of accumulated
> > > >> metric for specific period of time (should be easy to check via
> > > >> monitoring system graph) will show for how much business operations
> > were
> > > >> blocked in total, and non-zero current metric will show that we are
> > > >> experiencing issues right now. Boolean metric "are we blocked right
> > now"
> > > >> is not needed as it's obviously can be inferred from "current PME
> > block
> > > >> time".
> > > >>
> > > >> Best Regards,
> > > >> Ivan Rakov
> > > >>
> > > >> On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > >> > Nikita,
> > > >> >
> > > >> > I agree with total blocking duration metric but
> > > >> > I still don't understand why instant value indicating that
> > operations are
> > > >> > blocked should be boolean.
> > > >> > Duration time since blocking has started looks more appropriate
and
> > > >> useful.
> > > >> > It gives more information while semantic is left the same.
> > > >> >
> > > >> >
> > > >> >
> > > >> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelchev@gmail.com
> > >:
> > > >> >
> > > >> >> Folks,
> > > >> >>
> > > >> >> All previous suggestions have some disadvantages. It can
be several
> > > >> >> exchanges between two metric updates and fast exchange can
rewrite
> > > >> >> previous long exchange.
> > > >> >>
> > > >> >> We can introduce a metric of total blocking duration that
will
> > > >> >> accumulate at the end of the exchange. So, users will get
actual
> > > >> >> information about how long operations were blocked. Cluster
metric
> > > >> >> will be a maximum of local nodes metrics. And we need a boolean
> > metric
> > > >> >> that will indicate realtime status. It needs because of duration
> > > >> >> metric updates at the end of the exchange.
> > > >> >>
> > > >> >> So I propose to change the current metric that not released
to the
> > > >> >> totalCacheOperationsBlockingDuration metric and to add the
> > > >> >> isCacheOperationsBlocked metric.
> > > >> >>
> > > >> >> WDYT?
> > > >> >>
> > > >> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <
av@apache.org >:
> > > >> >>> Nikolay,
> > > >> >>>
> > > >> >>> Still see no reason to replace boolean with long.
> > > >> >>>
> > > >> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
> > nizhikov@apache.org >
> > > >> >> wrote:
> > > >> >>>> Anton.
> > > >> >>>>
> > > >> >>>> 1. Value exported based on SPI settings, not in the
moment it
> > changed.
> > > >> >>>>
> > > >> >>>> 2. Clock synchronisation - if we export start time,
we should
> > also
> > > >> >> export
> > > >> >>>> node local timestamp.
> > > >> >>>>
> > > >> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov
< av@apache.org >:
> > > >> >>>>
> > > >> >>>>> Folks,
> > > >> >>>>>
> > > >> >>>>> What's the reason for duration counting?
> > > >> >>>>> AFAIU, it's a monitoring system feature to count
the durations.
> > > >> >>>>> Sine monitoring system checks metrics periodically
it will know
> > the
> > > >> >>>>> duration by its own log.
> > > >> >>>>>
> > > >> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko
<
> > jokserfn@gmail.com >
> > > >> >>>>> wrote:
> > > >> >>>>>
> > > >> >>>>>> Nikita,
> > > >> >>>>>>
> > > >> >>>>>> Yes, I mean duration not timestamp. For the
metric name, I
> > suggest
> > > >> >>>>>> "cacheOperationsBlockingDuration", I think
it cleaner
> > represents
> > > >> >> what
> > > >> >>>> is
> > > >> >>>>>> blocked during PME.
> > > >> >>>>>> We can also combine both timestamp
> > > >> >> "cacheOperationsBlockingStartTs" and
> > > >> >>>>>> duration to have better correlation when
cache operations were
> > > >> >> blocked
> > > >> >>>>> and
> > > >> >>>>>> how much time it's taken.
> > > >> >>>>>> For instant view (like in JMX bean) a calculated
value as you
> > > >> >> mentioned
> > > >> >>>>>> can be used.
> > > >> >>>>>> For metrics are exported to some backend
(IEP-35) a counter
> > can be
> > > >> >>>> used.
> > > >> >>>>>> The counter is incremented by blocking time
after blocking has
> > > >> >> ended.
> > > >> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita
Amelchev <
> > nsamelchev@gmail.com
> > > >> >>> :
> > > >> >>>>>>> Pavel,
> > > >> >>>>>>>
> > > >> >>>>>>> The main purpose of this metric is
> > > >> >>>>>>>>> how much time we wait for resuming
cache operations
> > > >> >>>>>>> Seems I misunderstood you. Do you mean
timestamp or duration
> > here?
> > > >> >>>>>>>>> What do you think if we change
the boolean value of metric
> > to a
> > > >> >>>> long
> > > >> >>>>>>> value that represents time in milliseconds
when operations
> > were
> > > >> >>>> blocked?
> > > >> >>>>>>> This time can be calculated as (currentTime
-
> > > >> >>>>>>> timeSinceOperationsBlocked) in case of
timestamp.
> > > >> >>>>>>>
> > > >> >>>>>>> Duration will be more understandable.
It'll be something like
> > > >> >>>>>>> getCurrentBlockingPmeDuration. But I
haven't come up with a
> > better
> > > >> >>>>>>> name yet.
> > > >> >>>>>>>
> > > >> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel
Kovalenko <
> > jokserfn@gmail.com
> > > >> >>> :
> > > >> >>>>>>>> Nikita,
> > > >> >>>>>>>>
> > > >> >>>>>>>> I think getCurrentPmeDuration doesn't
show useful
> > information.
> > > >> >> The
> > > >> >>>>> main
> > > >> >>>>>>> PME side effect for end-users is blocking
cache operations.
> > Not
> > > >> >> all
> > > >> >>>> PME
> > > >> >>>>>>> time blocks it.
> > > >> >>>>>>>> What information gives to an end-user
timestamp of
> > > >> >>>>>>> "timeSinceOperationsBlocked"? For what
analysis it can be
> > used and
> > > >> >>>> how?
> > > >> >>>>>>>> пт, 19 июл. 2019 г. в 17:48,
Nikita Amelchev <
> > > >> >>  nsamelchev@gmail.com
> > > >> >>>>> :
> > > >> >>>>>>>>> Hi Pavel,
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> This time already can be obtained
from the
> > > >> >> getCurrentPmeDuration
> > > >> >>>> and
> > > >> >>>>>>>>> new isOperationsBlockedByPme
metrics.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> As an alternative solution, I
can rework recently added
> > > >> >>>>>>>>> getCurrentPmeDuration metric
(not released yet). Seems for
> > > >> >> users it
> > > >> >>>>>>>>> useless in case of non-blocking
PME.
> > > >> >>>>>>>>> Lets name it timeSinceOperationsBlocked.
It'll be timestamp
> > > >> >> when
> > > >> >>>>>>>>> blocking started (minimal value
of cluster nodes) and 0 if
> > > >> >> blocking
> > > >> >>>>>>>>> ends (there is no running PME).
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> WDYT?
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> пт, 19 июл. 2019 г. в
15:56, Pavel Kovalenko <
> > > >> >>  jokserfn@gmail.com >:
> > > >> >>>>>>>>>> Hi Nikita,
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Thank you for working on
this. What do you think if we
> > > >> >> change the
> > > >> >>>>>>> boolean
> > > >> >>>>>>>>>> value of metric to a long
value that represents time in
> > > >> >>>>> milliseconds
> > > >> >>>>>>> when
> > > >> >>>>>>>>>> operations were blocked?
> > > >> >>>>>>>>>> Since we have not only JMX
and now metrics are periodically
> > > >> >>>>> exported
> > > >> >>>>>>> to
> > > >> >>>>>>>>>> some backend it can give
a more clear picture of how much
> > > >> >> time we
> > > >> >>>>>>> wait for
> > > >> >>>>>>>>>> resuming cache operations
instead of instant boolean
> > > >> >> indicator.
> > > >> >>>>>>>>>> пт, 19 июл. 2019 г.
в 14:41, Nikita Amelchev <
> > > >> >>>>  nsamelchev@gmail.com
> > > >> >>>>>> :
> > > >> >>>>>>>>>>> Anton, Nikolay,
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> Thanks for the support.
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> For now, we have the
getCurrentPmeDuration() metric that
> > > >> >> does
> > > >> >>>> not
> > > >> >>>>>>> show
> > > >> >>>>>>>>>>> influence on the cluster
correctly. PME can be without
> > > >> >> blocking
> > > >> >>>>>>>>>>> operations. For example,
client node join/leave events.
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> I suggest add new metric
- isOperationsBlockedByPme().
> > > >> >>>> Together,
> > > >> >>>>>>> these
> > > >> >>>>>>>>>>> metrics will show influence
of the PME on cluster and user
> > > >> >>>>>>> operations.
> > > >> >>>>>>>>>>> I have prepared PR for
this (Bot visa is green). [1] Can
> > > >> >> anyone
> > > >> >>>>>>> take a
> > > >> >>>>>>>>>>> look?
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-11961
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> вт, 16 июл. 2019
г. в 14:58, Nikolay Izhikov <
> > > >> >>>>>  nizhikov@apache.org
> > > >> >>>>>>>> :
> > > >> >>>>>>>>>>>> I think administator
of Ignite cluster should be able to
> > > >> >>>>> monitor
> > > >> >>>>>>> all
> > > >> >>>>>>>>>>> Ignite process, including
non blocking PME.
> > > >> >>>>>>>>>>>> В Вт, 16/07/2019
в 14:57 +0300, Anton Vinogradov пишет:
> > > >> >>>>>>>>>>>>> BTW,
> > > >> >>>>>>>>>>>>> Found PME metric
- getCurrentPmeDuration().
> > > >> >>>>>>>>>>>>> Seems, it shows
exactly PME time and not so useful
> > > >> >> because
> > > >> >>>> of
> > > >> >>>>>>> this.
> > > >> >>>>>>>>>>>>> The goal it so
show exactly blocking period.
> > > >> >>>>>>>>>>>>> When PME cause
no blocking, it's a good PME and I see
> > > >> >> no
> > > >> >>>>>>> reason to have
> > > >> >>>>>>>>>>>>> monitoring related
to it :)
> > > >> >>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>> On Tue, Jul 16,
2019 at 2:50 PM Nikolay Izhikov <
> > > >> >>>>>>>  nizhikov@apache.org >
> > > >> >>>>>>>>>>> wrote:
> > > >> >>>>>>>>>>>>>> Anton.
> > > >> >>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>> Why do we
need to postpone implementation of this
> > > >> >>>> metrics?
> > > >> >>>>>>>>>>>>>> For now,
implementation of new metric is very simple.
> > > >> >>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>> I think we
can implement this metrics as a single
> > > >> >>>>>>> contribution.
> > > >> >>>>>>>>>>>>>> В Вт,
16/07/2019 в 13:47 +0300, Anton Vinogradov
> > > >> >> пишет:
> > > >> >>>>>>>>>>>>>>> Nikita,
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>> Looks
like all we need now is a 1 simple metric:
> > > >> >> are
> > > >> >>>>>>> operations
> > > >> >>>>>>>>>>> blocked?
> > > >> >>>>>>>>>>>>>>> Just
a true or false.
> > > >> >>>>>>>>>>>>>>> Lest
start from this.
> > > >> >>>>>>>>>>>>>>> All other
metrics can be extracted from logs now
> > > >> >> and
> > > >> >>>> can
> > > >> >>>>> be
> > > >> >>>>>>>>>>> implemented
> > > >> >>>>>>>>>>>>>>> later.
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>> On Tue,
Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > > >> >>>>>>>>>>>  nizhikov@apache.org
>
> > > >> >>>>>>>>>>>>>>> wrote:
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> +1.
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> Nikita,
please, go ahead.
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> вт,
16 июля 2019 г., 11:45 Nikita Amelchev <
> > > >> >>>>>>>  nsamelchev@gmail.com
> > > >> >>>>>>>>>>>> :
> > > >> >>>>>>>>>>>>>>>>>
Hello, Igniters.
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>
I suggest to add some useful metrics about the
> > > >> >>>>>>> partition map
> > > >> >>>>>>>>>>> exchange
> > > >> >>>>>>>>>>>>>>>>>
(PME). For now, the duration of PME stages
> > > >> >>>> available
> > > >> >>>>>>> only in
> > > >> >>>>>>>>>>> log
> > > >> >>>>>>>>>>>>>> files
> > > >> >>>>>>>>>>>>>>>>>
and cannot be obtained using JMX or other
> > > >> >> external
> > > >> >>>>>>> tools. [1]
> > > >> >>>>>>>>>>>>>>>>>
I made the list of local node metrics that
> > > >> >> help to
> > > >> >>>>>>> understand
> > > >> >>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>
actual status of current PME:
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>
1. initialVersion. Topology version that
> > > >> >> initiates
> > > >> >>>>> the
> > > >> >>>>>>>>>>> exchange.
> > > >> >>>>>>>>>>>>>>>>>
2. initTime. Time PME was started.
> > > >> >>>>>>>>>>>>>>>>>
3. initEvent. Event that triggered PME.
> > > >> >>>>>>>>>>>>>>>>>
4. partitionReleaseTime. Time when a node has
> > > >> >>>>> finished
> > > >> >>>>>>> waiting
> > > >> >>>>>>>>>>> for
> > > >> >>>>>>>>>>>>>> all
> > > >> >>>>>>>>>>>>>>>>>
updates and translations on a previous
> > > >> >> topology.
> > > >> >>>>>>>>>>>>>>>>>
5. sendSingleMessageTime. Time when a node
> > > >> >> sent a
> > > >> >>>>>>> single
> > > >> >>>>>>>>>>> message.
> > > >> >>>>>>>>>>>>>>>>>
6. recieveFullMessageTime. Time when a node
> > > >> >>>> received
> > > >> >>>>> a
> > > >> >>>>>>> full
> > > >> >>>>>>>>>>> message.
> > > >> >>>>>>>>>>>>>>>>>
7. finishTime. Time PME was ended.
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>
When new PME started all these metrics resets.
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>
These metrics help to understand:
> > > >> >>>>>>>>>>>>>>>>>
- how long PME was (current or previous).
> > > >> >>>>>>>>>>>>>>>>>
- how long awaited for all updates was
> > > >> >> completed.
> > > >> >>>>>>>>>>>>>>>>>
- what node blocks PME (didn't send a single
> > > >> >>>> message)
> > > >> >>>>>>>>>>>>>>>>>
- what triggered PME.
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>
Thoughts?
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>
[1]
> > > >> >>>>>  https://issues.apache.org/jira/browse/IGNITE-11961
> > > >> >>>>>>>>>>>>>>>>>
--
> > > >> >>>>>>>>>>>>>>>>>
Best wishes,
> > > >> >>>>>>>>>>>>>>>>>
Amelchev Nikita
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> --
> > > >> >>>>>>>>>>> Best wishes,
> > > >> >>>>>>>>>>> Amelchev Nikita
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> --
> > > >> >>>>>>>>> Best wishes,
> > > >> >>>>>>>>> Amelchev Nikita
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>> --
> > > >> >>>>>>> Best wishes,
> > > >> >>>>>>> Amelchev Nikita
> > > >> >>>>>>>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Best wishes,
> > > >> >> Amelchev Nikita
> > > >> >>
> > > >>
> > >
> > >
> > > --
> > > Zhenya Stanilovsky
> >
> >
> >
> > --
> > Best wishes,
> > Amelchev Nikita
> >



-- 
Best wishes,
Amelchev Nikita

Mime
View raw message