ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Kasnacheev <ilya.kasnach...@gmail.com>
Subject Re: Re[2]: Cache operations performance metrics
Date Mon, 23 Dec 2019 14:45:28 GMT
Hello!

Let me chime in to this discussion.

If we are doing any new metrics, please make sure that they are accessible.

I would expect that metrics are printed to console from time to time, at
least when they deviate from norm. It would also help if they are available
as Web Console screen, a system SQL view, or command.sh command - in that
order.

It would be ideal to start discussion of any new metrics
Web-Console-screen-first. Much easier to sell to community.

Let me tell about my skin in the game: as you know, I answer a large number
of user questions. I ask users for logs often, so I potentially benefit
from anything which is printed to logs, and I have zero benefit of
something that needs extensive pre-configuration, since users would often
abandon their efforts before they set up any comprehensive monitoring
framework. So it would really help me as we remove useless messages from
logs and add insightful messages there.

This also goes for existing monitoring! If you think that we have enough
metrics available, please make them more accessible to remove the need for
discussion.

Regards,
-- 
Ilya Kasnacheev


пт, 20 дек. 2019 г. в 16:59, Zhenya Stanilovsky <arzamas123@mail.ru.invalid
>:

>
> >> Is it become slower or faster?
> >
> >Very correct question! User saw "cache put time" metric becomes x2
> >bigger. Does it become slower or faster? Or we just put into the cache
> >values that 4x bigger in size? Or all time before we put values
> >locally and now we put values on remote nodes. Or our operations
> >execute in transaction and then time will depend on transaction type,
> >actions in transaction and other transaction and actually will nothing
> >talk about real cache operation. We have more questions then answers.
>
> Andrey, i hope i understand your point of view here, but between to have
> something and have nothing i choose — something, it sometimes really
> helpful. From real life case: i found 1 grid machine with very different io
> usage than others, «dig deeper» highlight cache with very different
> from other nodes cache put operations and final «dig deeper» help to found
> code bug, but to be clear — old mechanism works ok for me here, if new one
> would be more useful — why not ?
>
> >> On the other hand - if `PuTime` increased - then we know for sure, all
> operation executing `put` becomes slower.
> >
> >Of course not :) See above.
> >
> >On Fri, Dec 20, 2019 at 3:20 PM Николай Ижиков < nizhikov@apache.org
>
> wrote:
> >>
> >> > It also will be visible on other metrics
> >>
> >> How will it be visible?
> >>
> >> For example, the user saw «checkpoint time» metric becomes x2 bigger.
> >> How it relates to business operations? Is it become slower or faster?
> >> What does it mean for an application performance?
> >>
> >> On the other hand - if `PuTime` increased - then we know for sure, all
> operation executing `put` becomes slower.
> >>
> >> *Why* it’s become slower - is the essence of «go deeper» investigation.
> >>
> >> > 20 дек. 2019 г., в 15:07, Andrey Gura < agura@apache.org >
> написал(а):
> >> >
> >> >> If a cache has some percent of the relatively slow transaction this
> is a trigger to make a deeper investigation.
> >> >
> >> > It also will be visible on other metrics. So cache operations metrics
> >> > still useless because it transitive values.
> >> >
> >> >>> 1. Measure some important internals (WAL operations, checkpoint
> time, etc) because it can talk about real problems.
> >> >
> >> >> We already implement it.
> >> >
> >> > I don't talk that it isn't implemented. It is just example of things
> >> > that should be measured. All other metrics depends on internals.
> >> >
> >> >>> 2. Measure business operations in user context, not cache API
> operations.
> >> >
> >> >> Why do you think these approaches should exclude one another?
> >> >
> >> > Because one of them is useless.
> >> >
> >> > On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков < nizhikov@apache.org
> > wrote:
> >> >>
> >> >> Hello, Andrey.
> >> >>
> >> >>> Where the sense in this value? I explained why this metrics are
> relatively useless.
> >> >>
> >> >> I don’t agree with you.
> >> >> I believe they are not useless for a user.
> >> >> And I try to explain why I think so.
> >> >>
> >> >>> But user can't distinguish one transaction from another, so his
> knowledge doesn't make sense definitely.
> >> >>
> >> >> Users shouldn’t distinguish.
> >> >> If a cache has some percent of the relatively slow transaction this
> is a trigger to make a deeper investigation.
> >> >>
> >> >>> 1. Measure some important internals (WAL operations, checkpoint
> time, etc) because it can talk about real problems.
> >> >>
> >> >> We already implement it.
> >> >> What metrics are missing for internal processes?
> >> >>
> >> >>> 2. Measure business operations in user context, not cache API
> operations.
> >> >>
> >> >> Why do you think these approaches should exclude one another?
> >> >> Users definitely should measure whole business transaction
> performance.
> >> >>
> >> >> I think we should provide a way to measure part of the business
> transaction that relates to the Ignite.
> >> >>
> >> >>
> >> >>> 20 дек. 2019 г., в 13:02, Andrey Gura < agura@apache.org
>
> написал(а):
> >> >>>
> >> >>>> The goal of the proposed metrics is to measure whole cache
> operations behavior.
> >> >>>> It provides some kind of statistics(histograms) for it.
> >> >>>
> >> >>> Nikolay, reformulating doesn't make metrics more meaningful.
> Seriously :)
> >> >>>
> >> >>>> Yes, metrics will evaluate API call performance
> >> >>>
> >> >>> And what? Where the sense in this value? I explained why this
> metrics
> >> >>> are relatively useless.
> >> >>>
> >> >>>> These are metrics of client-side operation performance.
> >> >>>
> >> >>> Again. It's just a number without any sense.
> >> >>>
> >> >>>> I think a specific user has knowledge - what are his transactions.
> >> >>>
> >> >>> May be. But user can't distinguish one transaction from another,
so
> >> >>> his knowledge doesn't make sense definitely.
> >> >>>
> >> >>>> From these metrics it can answer on the question «If my
> transaction includes cacheXXX, how long it usually takes?»
> >> >>>
> >> >>> Actually not. The same caches can be involved in a dozen of
> >> >>> transactions and there are no ways to understand what transactions
> are
> >> >>> slow or fast. It is useless.
> >> >>>
> >> >>>> I disagree here.
> >> >>>> If you have a better approach to measure cache operations
> performance - please, share your vision.
> >> >>>
> >> >>> I already wrote about better approach. Two main points:
> >> >>>
> >> >>> 1. Measure some important internals (WAL operations, checkpoint
> time,
> >> >>> etc) because it can talk about real problems.
> >> >>> 2. Measure business operations in user context, not cache API
> operations.
> >> >>>
> >> >>> So what we have? We have useless metrics that are doubled by useless
> >> >>> histograms.
> >> >>>
> >> >>> We should reconsider approach to metrics and performance measuring.
> It
> >> >>> is hard and long task. There are no need to commit tons of useless
> >> >>> metrics that just decrease performance.
> >> >>>
> >> >>> Sorry for some sarcasm but I really believe in my opinion. Metrics
> >> >>> problem exists very very long time and existing metrics discussed
> many
> >> >>> times. No one can explain this metrics to users because it requires
> >> >>> too many additional knowledge about internals. And metric value
> >> >>> itself depends on many aspects of internals. It leads to
> impossibility
> >> >>> of interpretation. And it's good time to remove it (in AI 3.0 due
> to a
> >> >>> backward compatibility).
> >> >>>
> >> >>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <
> nizhikov.dev@gmail.com > wrote:
> >> >>>>
> >> >>>> Hello, Andrey.
> >> >>>>
> >> >>>> The goal of the proposed metrics is to measure whole cache
> operations behavior.
> >> >>>> It provides some kind of statistics(histograms) for it.
> >> >>>> For more fine-grained analysis one will be use tracing or other
> «go deeper» tools.
> >> >>>>
> >> >>>>>> Measured for API calls on the caller node side
> >> >>>>> Values will the same only for cases when node is remote
relative
> to data
> >> >>>>
> >> >>>> Yes, metrics will evaluate API call performance.
> >> >>>> I think this is the most valuable information from a user's
point
> of view.
> >> >>>>
> >> >>>> Regular user wants to know how fast his cache operation performs.
> >> >>>> And these metrics provide the answer.
> >> >>>>
> >> >>>>> For regular data node (server node) timing will depend
on answers
> for question:
> >> >>>>
> >> >>>> I think these answers are always available.
> >> >>>> I barely can imagine a scenario when one monitor «black box»
> cluster and don’t know it.
> >> >>>> Even so, all answers are provided through system view we brought
> to the Ignite :)
> >> >>>>
> >> >>>>> What is transaction commit or rollback time?
> >> >>>>
> >> >>>> These are metrics of client-side operation performance.
> >> >>>>
> >> >>>> I think a specific user has knowledge - what are his transactions.
> >> >>>> From these metrics it can answer on the question «If my
> transaction includes cacheXXX, how long it usually takes?»
> >> >>>> I think it’s very valuable knowledge.
> >> >>>>
> >> >>>>> It will be implemented for most types of messages.
> >> >>>>
> >> >>>> Good, let’s do it?
> >> >>>>
> >> >>>>> So, from my point of view, commits for get/put/remove and
> commit/rollback should be reverted.
> >> >>>>
> >> >>>> I disagree here.
> >> >>>> If you have a better approach to measure cache operations
> performance - please, share your vision.
> >> >>>>
> >> >>>>> 19 дек. 2019 г., в 16:03, Andrey Gura < agura@apache.org
>
> написал(а):
> >> >>>>>
> >> >>>>> From my point of view, Ignite should provide meaningful
metrics
> for
> >> >>>>> internal components that could be useful for monitoring
and
> analysis.
> >> >>>>> All suggested options are meaningless in a sense. Below
I'll try
> >> >>>>> explain why.
> >> >>>>>
> >> >>>>>> * `get`, `put`, `remove` time histograms. Measured
for API calls
> on the caller node side.
> >> >>>>>> Implemented in [1], commit [2].
> >> >>>>>
> >> >>>>> All cache operations in Ignite are distributed. So each
value
> measured
> >> >>>>> for some cache operation will vary depending on where actually
> >> >>>>> operation is performed. Values will the same only for cases
when
> node
> >> >>>>> is remote relative to data (e.g. client node).
> >> >>>>>
> >> >>>>> For regular data node (server node) timing will depend
on answers
> for question:
> >> >>>>>
> >> >>>>> - is node primary for particular key or not? (for all operations)
> >> >>>>> - how many backups configured for the cache? (for put and
remove)
> >> >>>>> - what write synchronization mode is configured for particular
> cache?
> >> >>>>> (for put and remove)
> >> >>>>> - is readFromBackup enabled for the cache? (for get)
> >> >>>>>
> >> >>>>> Both Ignite users and Ignite developers can't make any
decision
> based
> >> >>>>> on this metrics.
> >> >>>>>
> >> >>>>>> * `commit`, `rollback` time histograms. Measured for
API calls
> on the caller node side [3].
> >> >>>>>
> >> >>>>> What is transaction commit or rollback time? How it calculates
in
> >> >>>>> Ignite now? What actions included into transaction? What
actions
> not
> >> >>>>> related with cache executed during transactions?
> >> >>>>>
> >> >>>>> There is no any sense in time of transaction commit or
rollback
> >> >>>>> because there are no any way to understand what transaction
was
> >> >>>>> performed in particular period of time. Usually a lot of
> transactions
> >> >>>>> and we can't to distinguish from each other.
> >> >>>>>
> >> >>>>> Moreover, transaction usually treats as business operation.
So
> only
> >> >>>>> way to measure performance properly is measure business
operation
> >> >>>>> time. That is user should create own metrics set for some
business
> >> >>>>> API.
> >> >>>>>
> >> >>>>> Further. What about cross cache transactions? At the moment
tx
> >> >>>>> commit/rollback time will be added to corresponding metrics
per
> each
> >> >>>>> cache evolved to the transaction. The *same time* for *each
> cache*.
> >> >>>>> Absolutely meaningless.
> >> >>>>>
> >> >>>>> Again, both Ignite users and Ignite developers can't make
any
> decision
> >> >>>>> based on this metrics. But users can create own metrics
set.
> >> >>>>>
> >> >>>>>> * histograms that measure the time of processing `get`,
`put`,
> `remove`, `commit`, `rollback` messages on affinity nodes(primary and
> backups).
> >> >>>>>> Ticket doesn't exist for it.
> >> >>>>>
> >> >>>>> It will be implemented for most types of messages.
> >> >>>>>
> >> >>>>> Metrics, application monitoring, performance analysis and
> measurement
> >> >>>>> are a a little harder than it sounds. Therefore, we must
approach
> this
> >> >>>>> issue more carefully.
> >> >>>>> Blindly adding new types of metrics will not only not improve
the
> >> >>>>> situation, but will also worsen the overall performance
of the
> system
> >> >>>>> because metric calculation always on the hot path.
> >> >>>>>
> >> >>>>> So, from my point of view, commits for get/put/remove and
> >> >>>>> commit/rollback should be reverted.
> >> >>>>>
> >> >>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <
> nsamelchev@gmail.com > wrote:
> >> >>>>>>
> >> >>>>>> I think these metrics are useful.
> >> >>>>>>
> >> >>>>>> I have prepared PR [1] for commit and rollback histograms.
[2]
> >> >>>>>> Nikolay, could you take a look, please?
> >> >>>>>>
> >> >>>>>> If you do not mind, I will try to add affinity-nodes
cache
> metrics:
> >> >>>>>>>> * histograms that measure the time of processing
`get`, `put`,
> `remove`, `commit`, `rollback` messages on affinity nodes(primary and
> backups). Ticket doesn't exist for it.
> >> >>>>>>
> >> >>>>>> I have filed a ticket for it. [3]
> >> >>>>>>
> >> >>>>>> [1]  https://github.com/apache/ignite/pull/7141
> >> >>>>>> [2]  https://issues.apache.org/jira/browse/IGNITE-12450
> >> >>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12453
> >> >>>>>>
> >> >>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov
<
> alexey.scherbakoff@gmail.com >:
> >> >>>>>>>
> >> >>>>>>> I think they are very useful.
> >> >>>>>>>
> >> >>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай
Ижиков <
> nizhikov@apache.org >:
> >> >>>>>>>
> >> >>>>>>>> Hello, Alexei.
> >> >>>>>>>>
> >> >>>>>>>> Thanks for the link on the ticket, lableled
it with the IEP-35
> label.
> >> >>>>>>>> What do you think about proposed metrics set?
> >> >>>>>>>>
> >> >>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov
<
> >> >>>>>>>>  alexey.scherbakoff@gmail.com > написал(а):
> >> >>>>>>>>>
> >> >>>>>>>>> Nikolay,
> >> >>>>>>>>>
> >> >>>>>>>>> What about batch operations?
> >> >>>>>>>>>
> >> >>>>>>>>> For messages processing the ticket does
exist and even has an
> >> >>>>>>>>> implementation from before new metrics
API times [1]
> >> >>>>>>>>>
> >> >>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-10418
> >> >>>>>>>>>
> >> >>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай
Ижиков <
> nizhikov@apache.org >:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hello, Igniters.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I want to provide the user answers
to the following
> question: "How cache
> >> >>>>>>>>>> API operations perform?"
> >> >>>>>>>>>> It seems, we need to implements metrics
for basic cache API
> operations
> >> >>>>>>>>>> like get, put, remove for it.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I think we should provide the following
metrics:
> >> >>>>>>>>>>
> >> >>>>>>>>>> * `get`, `put`, `remove` time histograms.
Measured for API
> calls on the
> >> >>>>>>>>>> caller node side.
> >> >>>>>>>>>> Implemented in [1], commit [2].
> >> >>>>>>>>>>
> >> >>>>>>>>>> * `commit`, `rollback` time histograms.
Measured for API
> calls on the
> >> >>>>>>>>>> caller node side [3].
> >> >>>>>>>>>>
> >> >>>>>>>>>> * histograms that measure the time
of processing `get`,
> `put`, `remove`,
> >> >>>>>>>>>> `commit`, `rollback` messages on affinity
nodes(primary and
> backups).
> >> >>>>>>>>>> Ticket doesn't exist for it.
> >> >>>>>>>>>>
> >> >>>>>>>>>> What do you think?
> >> >>>>>>>>>>
> >> >>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-12219
> >> >>>>>>>>>> [2]
> >> >>>>>>>>>>
> >> >>>>>>>>
> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> >> >>>>>>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12450
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>>
> >> >>>>>>>>> Best regards,
> >> >>>>>>>>> Alexei Scherbakov
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>>
> >> >>>>>>> Best regards,
> >> >>>>>>> Alexei Scherbakov
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best wishes,
> >> >>>>>> Amelchev Nikita
> >> >>>>
> >> >>
> >>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message