ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikita Amelchev <nsamelc...@gmail.com>
Subject Re: Partition map exchange metrics
Date Fri, 19 Jul 2019 16:10:26 GMT
Pavel,

The main purpose of this metric is
>> how much time we wait for resuming cache operations

Seems I misunderstood you. Do you mean timestamp or duration here?
>> What do you think if we change the boolean value of metric to a long value that represents
time in milliseconds when operations were blocked?

This time can be calculated as (currentTime -
timeSinceOperationsBlocked) in case of timestamp.

Duration will be more understandable. It'll be something like
getCurrentBlockingPmeDuration. But I haven't come up with a better
name yet.

пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <jokserfn@gmail.com>:
>
> Nikita,
>
> I think getCurrentPmeDuration doesn't show useful information. The main PME side effect
for end-users is blocking cache operations. Not all PME time blocks it.
> What information gives to an end-user timestamp of "timeSinceOperationsBlocked"? For
what analysis it can be used and how?
>
> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <nsamelchev@gmail.com>:
>>
>> Hi Pavel,
>>
>> This time already can be obtained from the getCurrentPmeDuration and
>> new isOperationsBlockedByPme metrics.
>>
>> As an alternative solution, I can rework recently added
>> getCurrentPmeDuration metric (not released yet). Seems for users it
>> useless in case of non-blocking PME.
>> Lets name it timeSinceOperationsBlocked. It'll be timestamp when
>> blocking started (minimal value of cluster nodes) and 0 if blocking
>> ends (there is no running PME).
>>
>> WDYT?
>>
>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <jokserfn@gmail.com>:
>> >
>> > Hi Nikita,
>> >
>> > Thank you for working on this. What do you think if we change the boolean
>> > value of metric to a long value that represents time in milliseconds when
>> > operations were blocked?
>> > Since we have not only JMX and now metrics are periodically exported to
>> > some backend it can give a more clear picture of how much time we wait for
>> > resuming cache operations instead of instant boolean indicator.
>> >
>> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <nsamelchev@gmail.com>:
>> >
>> > > Anton, Nikolay,
>> > >
>> > > Thanks for the support.
>> > >
>> > > For now, we have the getCurrentPmeDuration() metric that does not show
>> > > influence on the cluster correctly. PME can be without blocking
>> > > operations. For example, client node join/leave events.
>> > >
>> > > I suggest add new metric - isOperationsBlockedByPme(). Together, these
>> > > metrics will show influence of the PME on cluster and user operations.
>> > >
>> > > I have prepared PR for this (Bot visa is green). [1] Can anyone take a
>> > > look?
>> > >
>> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
>> > >
>> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <nizhikov@apache.org>:
>> > >
>> > > >
>> > > > I think administator of Ignite cluster should be able to monitor all
>> > > Ignite process, including non blocking PME.
>> > > >
>> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
>> > > > > BTW,
>> > > > > Found PME metric - getCurrentPmeDuration().
>> > > > > Seems, it shows exactly PME time and not so useful because of
this.
>> > > > > The goal it so show exactly blocking period.
>> > > > > When PME cause no blocking, it's a good PME and I see no reason
to have
>> > > > > monitoring related to it :)
>> > > > >
>> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <nizhikov@apache.org>
>> > > wrote:
>> > > > >
>> > > > > > Anton.
>> > > > > >
>> > > > > > Why do we need to postpone implementation of this metrics?
>> > > > > > For now, implementation of new metric is very simple.
>> > > > > >
>> > > > > > I think we can implement this metrics as a single contribution.
>> > > > > >
>> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
>> > > > > > > Nikita,
>> > > > > > >
>> > > > > > > Looks like all we need now is a 1 simple metric: are
operations
>> > > blocked?
>> > > > > > > Just a true or false.
>> > > > > > > Lest start from this.
>> > > > > > > All other metrics can be extracted from logs now and
can be
>> > > implemented
>> > > > > > > later.
>> > > > > > >
>> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
>> > > nizhikov@apache.org>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > +1.
>> > > > > > > >
>> > > > > > > > Nikita, please, go ahead.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev
<nsamelchev@gmail.com
>> > > >:
>> > > > > > > >
>> > > > > > > > > Hello, Igniters.
>> > > > > > > > >
>> > > > > > > > > I suggest to add some useful metrics about
the partition map
>> > > exchange
>> > > > > > > > > (PME). For now, the duration of PME stages
available only in
>> > > log
>> > > > > >
>> > > > > > files
>> > > > > > > > > and cannot be obtained using JMX or other
external tools. [1]
>> > > > > > > > >
>> > > > > > > > > I made the list of local node metrics that
help to understand
>> > > the
>> > > > > > > > > actual status of current PME:
>> > > > > > > > >
>> > > > > > > > > 1. initialVersion. Topology version that
initiates the
>> > > exchange.
>> > > > > > > > > 2. initTime. Time PME was started.
>> > > > > > > > > 3. initEvent. Event that triggered PME.
>> > > > > > > > > 4. partitionReleaseTime. Time when a node
has finished waiting
>> > > for
>> > > > > >
>> > > > > > all
>> > > > > > > > > updates and translations on a previous topology.
>> > > > > > > > > 5. sendSingleMessageTime. Time when a node
sent a single
>> > > message.
>> > > > > > > > > 6. recieveFullMessageTime. Time when a node
received a full
>> > > message.
>> > > > > > > > > 7. finishTime. Time PME was ended.
>> > > > > > > > >
>> > > > > > > > > When new PME started all these metrics resets.
>> > > > > > > > >
>> > > > > > > > > These metrics help to understand:
>> > > > > > > > > - how long PME was (current or previous).
>> > > > > > > > > - how long awaited for all updates was completed.
>> > > > > > > > > - what node blocks PME (didn't send a single
message)
>> > > > > > > > > - what triggered PME.
>> > > > > > > > >
>> > > > > > > > > Thoughts?
>> > > > > > > > >
>> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Best wishes,
>> > > > > > > > > Amelchev Nikita
>> > > > > > > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best wishes,
>> > > Amelchev Nikita
>> > >
>>
>>
>>
>> --
>> Best wishes,
>> Amelchev Nikita



-- 
Best wishes,
Amelchev Nikita

Mime
View raw message