ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Gura <ag...@apache.org>
Subject Re: Re[2]: Cache operations performance metrics
Date Fri, 20 Dec 2019 14:21:07 GMT
> but between to have something and have nothing i choose — something

We already have "something". put, get, etc. metrics. As I told early
it relatively useless. But the same metrics with histograms doesn't
add any value.

> i found 1 grid machine with very different io usage than others, «dig deeper» highlight
cache with very different from other nodes cache put operations and final «dig deeper» help
to found code bug

I believe the same could be noticed using PK index stats.

> if new one would be more useful — why not ?

If some particular value is relatively useless then the same histogram
will be still relatively useless :) It's my point. Stop adding a dozen
of metrics, start thinking about benefits and meaning. Discuss it with
community.


On Fri, Dec 20, 2019 at 4:59 PM Zhenya Stanilovsky
<arzamas123@mail.ru.invalid> wrote:
>
>
> >> Is it become slower or faster?
> >
> >Very correct question! User saw "cache put time" metric becomes x2
> >bigger. Does it become slower or faster? Or we just put into the cache
> >values that 4x bigger in size? Or all time before we put values
> >locally and now we put values on remote nodes. Or our operations
> >execute in transaction and then time will depend on transaction type,
> >actions in transaction and other transaction and actually will nothing
> >talk about real cache operation. We have more questions then answers.
>
> Andrey, i hope i understand your point of view here, but between to have something and
have nothing i choose — something, it sometimes really helpful. From real life case: i found
1 grid machine with very different io usage than others, «dig deeper» highlight cache with
very different from other nodes cache put operations and final «dig deeper» help to found
code bug, but to be clear — old mechanism works ok for me here, if new one would be more
useful — why not ?
>
> >> On the other hand - if `PuTime` increased - then we know for sure, all operation
executing `put` becomes slower.
> >
> >Of course not :) See above.
> >
> >On Fri, Dec 20, 2019 at 3:20 PM Николай Ижиков < nizhikov@apache.org
> wrote:
> >>
> >> > It also will be visible on other metrics
> >>
> >> How will it be visible?
> >>
> >> For example, the user saw «checkpoint time» metric becomes x2 bigger.
> >> How it relates to business operations? Is it become slower or faster?
> >> What does it mean for an application performance?
> >>
> >> On the other hand - if `PuTime` increased - then we know for sure, all operation
executing `put` becomes slower.
> >>
> >> *Why* it’s become slower - is the essence of «go deeper» investigation.
> >>
> >> > 20 дек. 2019 г., в 15:07, Andrey Gura < agura@apache.org > написал(а):
> >> >
> >> >> If a cache has some percent of the relatively slow transaction this
is a trigger to make a deeper investigation.
> >> >
> >> > It also will be visible on other metrics. So cache operations metrics
> >> > still useless because it transitive values.
> >> >
> >> >>> 1. Measure some important internals (WAL operations, checkpoint
time, etc) because it can talk about real problems.
> >> >
> >> >> We already implement it.
> >> >
> >> > I don't talk that it isn't implemented. It is just example of things
> >> > that should be measured. All other metrics depends on internals.
> >> >
> >> >>> 2. Measure business operations in user context, not cache API operations.
> >> >
> >> >> Why do you think these approaches should exclude one another?
> >> >
> >> > Because one of them is useless.
> >> >
> >> > On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков < nizhikov@apache.org
> wrote:
> >> >>
> >> >> Hello, Andrey.
> >> >>
> >> >>> Where the sense in this value? I explained why this metrics are
relatively useless.
> >> >>
> >> >> I don’t agree with you.
> >> >> I believe they are not useless for a user.
> >> >> And I try to explain why I think so.
> >> >>
> >> >>> But user can't distinguish one transaction from another, so his
knowledge doesn't make sense definitely.
> >> >>
> >> >> Users shouldn’t distinguish.
> >> >> If a cache has some percent of the relatively slow transaction this
is a trigger to make a deeper investigation.
> >> >>
> >> >>> 1. Measure some important internals (WAL operations, checkpoint
time, etc) because it can talk about real problems.
> >> >>
> >> >> We already implement it.
> >> >> What metrics are missing for internal processes?
> >> >>
> >> >>> 2. Measure business operations in user context, not cache API operations.
> >> >>
> >> >> Why do you think these approaches should exclude one another?
> >> >> Users definitely should measure whole business transaction performance.
> >> >>
> >> >> I think we should provide a way to measure part of the business transaction
that relates to the Ignite.
> >> >>
> >> >>
> >> >>> 20 дек. 2019 г., в 13:02, Andrey Gura < agura@apache.org
> написал(а):
> >> >>>
> >> >>>> The goal of the proposed metrics is to measure whole cache
operations behavior.
> >> >>>> It provides some kind of statistics(histograms) for it.
> >> >>>
> >> >>> Nikolay, reformulating doesn't make metrics more meaningful. Seriously
:)
> >> >>>
> >> >>>> Yes, metrics will evaluate API call performance
> >> >>>
> >> >>> And what? Where the sense in this value? I explained why this metrics
> >> >>> are relatively useless.
> >> >>>
> >> >>>> These are metrics of client-side operation performance.
> >> >>>
> >> >>> Again. It's just a number without any sense.
> >> >>>
> >> >>>> I think a specific user has knowledge - what are his transactions.
> >> >>>
> >> >>> May be. But user can't distinguish one transaction from another,
so
> >> >>> his knowledge doesn't make sense definitely.
> >> >>>
> >> >>>> From these metrics it can answer on the question «If my transaction
includes cacheXXX, how long it usually takes?»
> >> >>>
> >> >>> Actually not. The same caches can be involved in a dozen of
> >> >>> transactions and there are no ways to understand what transactions
are
> >> >>> slow or fast. It is useless.
> >> >>>
> >> >>>> I disagree here.
> >> >>>> If you have a better approach to measure cache operations performance
- please, share your vision.
> >> >>>
> >> >>> I already wrote about better approach. Two main points:
> >> >>>
> >> >>> 1. Measure some important internals (WAL operations, checkpoint
time,
> >> >>> etc) because it can talk about real problems.
> >> >>> 2. Measure business operations in user context, not cache API operations.
> >> >>>
> >> >>> So what we have? We have useless metrics that are doubled by useless
> >> >>> histograms.
> >> >>>
> >> >>> We should reconsider approach to metrics and performance measuring.
It
> >> >>> is hard and long task. There are no need to commit tons of useless
> >> >>> metrics that just decrease performance.
> >> >>>
> >> >>> Sorry for some sarcasm but I really believe in my opinion. Metrics
> >> >>> problem exists very very long time and existing metrics discussed
many
> >> >>> times. No one can explain this metrics to users because it requires
> >> >>> too many additional knowledge about internals. And metric value
> >> >>> itself depends on many aspects of internals. It leads to impossibility
> >> >>> of interpretation. And it's good time to remove it (in AI 3.0 due
to a
> >> >>> backward compatibility).
> >> >>>
> >> >>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <
nizhikov.dev@gmail.com > wrote:
> >> >>>>
> >> >>>> Hello, Andrey.
> >> >>>>
> >> >>>> The goal of the proposed metrics is to measure whole cache
operations behavior.
> >> >>>> It provides some kind of statistics(histograms) for it.
> >> >>>> For more fine-grained analysis one will be use tracing or other
«go deeper» tools.
> >> >>>>
> >> >>>>>> Measured for API calls on the caller node side
> >> >>>>> Values will the same only for cases when node is remote
relative to data
> >> >>>>
> >> >>>> Yes, metrics will evaluate API call performance.
> >> >>>> I think this is the most valuable information from a user's
point of view.
> >> >>>>
> >> >>>> Regular user wants to know how fast his cache operation performs.
> >> >>>> And these metrics provide the answer.
> >> >>>>
> >> >>>>> For regular data node (server node) timing will depend
on answers for question:
> >> >>>>
> >> >>>> I think these answers are always available.
> >> >>>> I barely can imagine a scenario when one monitor «black box»
cluster and don’t know it.
> >> >>>> Even so, all answers are provided through system view we brought
to the Ignite :)
> >> >>>>
> >> >>>>> What is transaction commit or rollback time?
> >> >>>>
> >> >>>> These are metrics of client-side operation performance.
> >> >>>>
> >> >>>> I think a specific user has knowledge - what are his transactions.
> >> >>>> From these metrics it can answer on the question «If my transaction
includes cacheXXX, how long it usually takes?»
> >> >>>> I think it’s very valuable knowledge.
> >> >>>>
> >> >>>>> It will be implemented for most types of messages.
> >> >>>>
> >> >>>> Good, let’s do it?
> >> >>>>
> >> >>>>> So, from my point of view, commits for get/put/remove and
commit/rollback should be reverted.
> >> >>>>
> >> >>>> I disagree here.
> >> >>>> If you have a better approach to measure cache operations performance
- please, share your vision.
> >> >>>>
> >> >>>>> 19 дек. 2019 г., в 16:03, Andrey Gura < agura@apache.org
> написал(а):
> >> >>>>>
> >> >>>>> From my point of view, Ignite should provide meaningful
metrics for
> >> >>>>> internal components that could be useful for monitoring
and analysis.
> >> >>>>> All suggested options are meaningless in a sense. Below
I'll try
> >> >>>>> explain why.
> >> >>>>>
> >> >>>>>> * `get`, `put`, `remove` time histograms. Measured
for API calls on the caller node side.
> >> >>>>>> Implemented in [1], commit [2].
> >> >>>>>
> >> >>>>> All cache operations in Ignite are distributed. So each
value measured
> >> >>>>> for some cache operation will vary depending on where actually
> >> >>>>> operation is performed. Values will the same only for cases
when node
> >> >>>>> is remote relative to data (e.g. client node).
> >> >>>>>
> >> >>>>> For regular data node (server node) timing will depend
on answers for question:
> >> >>>>>
> >> >>>>> - is node primary for particular key or not? (for all operations)
> >> >>>>> - how many backups configured for the cache? (for put and
remove)
> >> >>>>> - what write synchronization mode is configured for particular
cache?
> >> >>>>> (for put and remove)
> >> >>>>> - is readFromBackup enabled for the cache? (for get)
> >> >>>>>
> >> >>>>> Both Ignite users and Ignite developers can't make any
decision based
> >> >>>>> on this metrics.
> >> >>>>>
> >> >>>>>> * `commit`, `rollback` time histograms. Measured for
API calls on the caller node side [3].
> >> >>>>>
> >> >>>>> What is transaction commit or rollback time? How it calculates
in
> >> >>>>> Ignite now? What actions included into transaction? What
actions not
> >> >>>>> related with cache executed during transactions?
> >> >>>>>
> >> >>>>> There is no any sense in time of transaction commit or
rollback
> >> >>>>> because there are no any way to understand what transaction
was
> >> >>>>> performed in particular period of time. Usually a lot of
transactions
> >> >>>>> and we can't to distinguish from each other.
> >> >>>>>
> >> >>>>> Moreover, transaction usually treats as business operation.
So only
> >> >>>>> way to measure performance properly is measure business
operation
> >> >>>>> time. That is user should create own metrics set for some
business
> >> >>>>> API.
> >> >>>>>
> >> >>>>> Further. What about cross cache transactions? At the moment
tx
> >> >>>>> commit/rollback time will be added to corresponding metrics
per each
> >> >>>>> cache evolved to the transaction. The *same time* for *each
cache*.
> >> >>>>> Absolutely meaningless.
> >> >>>>>
> >> >>>>> Again, both Ignite users and Ignite developers can't make
any decision
> >> >>>>> based on this metrics. But users can create own metrics
set.
> >> >>>>>
> >> >>>>>> * histograms that measure the time of processing `get`,
`put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
> >> >>>>>> Ticket doesn't exist for it.
> >> >>>>>
> >> >>>>> It will be implemented for most types of messages.
> >> >>>>>
> >> >>>>> Metrics, application monitoring, performance analysis and
measurement
> >> >>>>> are a a little harder than it sounds. Therefore, we must
approach this
> >> >>>>> issue more carefully.
> >> >>>>> Blindly adding new types of metrics will not only not improve
the
> >> >>>>> situation, but will also worsen the overall performance
of the system
> >> >>>>> because metric calculation always on the hot path.
> >> >>>>>
> >> >>>>> So, from my point of view, commits for get/put/remove and
> >> >>>>> commit/rollback should be reverted.
> >> >>>>>
> >> >>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev < nsamelchev@gmail.com
> wrote:
> >> >>>>>>
> >> >>>>>> I think these metrics are useful.
> >> >>>>>>
> >> >>>>>> I have prepared PR [1] for commit and rollback histograms.
[2]
> >> >>>>>> Nikolay, could you take a look, please?
> >> >>>>>>
> >> >>>>>> If you do not mind, I will try to add affinity-nodes
cache metrics:
> >> >>>>>>>> * histograms that measure the time of processing
`get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
Ticket doesn't exist for it.
> >> >>>>>>
> >> >>>>>> I have filed a ticket for it. [3]
> >> >>>>>>
> >> >>>>>> [1]  https://github.com/apache/ignite/pull/7141
> >> >>>>>> [2]  https://issues.apache.org/jira/browse/IGNITE-12450
> >> >>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12453
> >> >>>>>>
> >> >>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov
< alexey.scherbakoff@gmail.com >:
> >> >>>>>>>
> >> >>>>>>> I think they are very useful.
> >> >>>>>>>
> >> >>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай
Ижиков < nizhikov@apache.org >:
> >> >>>>>>>
> >> >>>>>>>> Hello, Alexei.
> >> >>>>>>>>
> >> >>>>>>>> Thanks for the link on the ticket, lableled
it with the IEP-35 label.
> >> >>>>>>>> What do you think about proposed metrics set?
> >> >>>>>>>>
> >> >>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov
<
> >> >>>>>>>>  alexey.scherbakoff@gmail.com > написал(а):
> >> >>>>>>>>>
> >> >>>>>>>>> Nikolay,
> >> >>>>>>>>>
> >> >>>>>>>>> What about batch operations?
> >> >>>>>>>>>
> >> >>>>>>>>> For messages processing the ticket does
exist and even has an
> >> >>>>>>>>> implementation from before new metrics
API times [1]
> >> >>>>>>>>>
> >> >>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-10418
> >> >>>>>>>>>
> >> >>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай
Ижиков < nizhikov@apache.org >:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hello, Igniters.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I want to provide the user answers
to the following question: "How cache
> >> >>>>>>>>>> API operations perform?"
> >> >>>>>>>>>> It seems, we need to implements metrics
for basic cache API operations
> >> >>>>>>>>>> like get, put, remove for it.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I think we should provide the following
metrics:
> >> >>>>>>>>>>
> >> >>>>>>>>>> * `get`, `put`, `remove` time histograms.
Measured for API calls on the
> >> >>>>>>>>>> caller node side.
> >> >>>>>>>>>> Implemented in [1], commit [2].
> >> >>>>>>>>>>
> >> >>>>>>>>>> * `commit`, `rollback` time histograms.
Measured for API calls on the
> >> >>>>>>>>>> caller node side [3].
> >> >>>>>>>>>>
> >> >>>>>>>>>> * histograms that measure the time
of processing `get`, `put`, `remove`,
> >> >>>>>>>>>> `commit`, `rollback` messages on affinity
nodes(primary and backups).
> >> >>>>>>>>>> Ticket doesn't exist for it.
> >> >>>>>>>>>>
> >> >>>>>>>>>> What do you think?
> >> >>>>>>>>>>
> >> >>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-12219
> >> >>>>>>>>>> [2]
> >> >>>>>>>>>>
> >> >>>>>>>>  https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> >> >>>>>>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12450
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>>
> >> >>>>>>>>> Best regards,
> >> >>>>>>>>> Alexei Scherbakov
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>>
> >> >>>>>>> Best regards,
> >> >>>>>>> Alexei Scherbakov
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best wishes,
> >> >>>>>> Amelchev Nikita
> >> >>>>
> >> >>
> >>
>
>
>
>

Mime
View raw message