ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Николай Ижиков" <nizhi...@apache.org>
Subject Re: Cache operations performance metrics
Date Fri, 20 Dec 2019 12:20:08 GMT
> It also will be visible on other metrics

How will it be visible?

For example, the user saw «checkpoint time» metric becomes x2 bigger.
How it relates to business operations? Is it become slower or faster?
What does it mean for an application performance?

On the other hand - if `PuTime` increased - then we know for sure, all operation executing
`put` becomes slower.

*Why* it’s become slower - is the essence of «go deeper» investigation.

> 20 дек. 2019 г., в 15:07, Andrey Gura <agura@apache.org> написал(а):
> 
>> If a cache has some percent of the relatively slow transaction this is a trigger
to make a deeper investigation.
> 
> It also will be visible on other metrics. So cache operations metrics
> still useless because it transitive values.
> 
>>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because
it can talk about real problems.
> 
>> We already implement it.
> 
> I don't talk that it isn't implemented. It is just example of things
> that should be measured. All other metrics depends on internals.
> 
>>> 2. Measure business operations in user context, not cache API operations.
> 
>> Why do you think these approaches should exclude one another?
> 
> Because one of them is useless.
> 
> On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков <nizhikov@apache.org>
wrote:
>> 
>> Hello, Andrey.
>> 
>>> Where the sense in this value? I explained why this metrics are relatively useless.
>> 
>> I don’t agree with you.
>> I believe they are not useless for a user.
>> And I try to explain why I think so.
>> 
>>> But user can't distinguish one transaction from another, so his knowledge doesn't
make sense definitely.
>> 
>> Users shouldn’t distinguish.
>> If a cache has some percent of the relatively slow transaction this is a trigger
to make a deeper investigation.
>> 
>>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because
it can talk about real problems.
>> 
>> We already implement it.
>> What metrics are missing for internal processes?
>> 
>>> 2. Measure business operations in user context, not cache API operations.
>> 
>> Why do you think these approaches should exclude one another?
>> Users definitely should measure whole business transaction performance.
>> 
>> I think we should provide a way to measure part of the business transaction that
relates to the Ignite.
>> 
>> 
>>> 20 дек. 2019 г., в 13:02, Andrey Gura <agura@apache.org> написал(а):
>>> 
>>>> The goal of the proposed metrics is to measure whole cache operations behavior.
>>>> It provides some kind of statistics(histograms) for it.
>>> 
>>> Nikolay, reformulating doesn't make metrics more meaningful. Seriously :)
>>> 
>>>> Yes, metrics will evaluate API call performance
>>> 
>>> And what? Where the sense in this value? I explained why this metrics
>>> are relatively useless.
>>> 
>>>> These are metrics of client-side operation performance.
>>> 
>>> Again. It's just a number without any sense.
>>> 
>>>> I think a specific user has knowledge - what are his transactions.
>>> 
>>> May be. But user can't distinguish one transaction from another, so
>>> his knowledge doesn't make sense definitely.
>>> 
>>>> From these metrics it can answer on the question «If my transaction includes
cacheXXX, how long it usually takes?»
>>> 
>>> Actually not. The same caches can be involved  in a dozen of
>>> transactions and there are no ways to understand what transactions are
>>> slow or fast. It is useless.
>>> 
>>>> I disagree here.
>>>> If you have a better approach to measure cache operations performance - please,
share your vision.
>>> 
>>> I already wrote about better approach. Two main points:
>>> 
>>> 1. Measure some important internals (WAL operations, checkpoint time,
>>> etc) because it can talk about real problems.
>>> 2. Measure business operations in user context, not cache API operations.
>>> 
>>> So  what we have? We have useless metrics that are doubled by useless
>>> histograms.
>>> 
>>> We should reconsider approach to metrics and performance measuring. It
>>> is hard and long task. There are no need to commit tons of useless
>>> metrics that just decrease performance.
>>> 
>>> Sorry for some sarcasm but I really believe in my opinion. Metrics
>>> problem exists very very long time and existing metrics discussed many
>>> times. No one can explain this metrics to users because it requires
>>> too many additional knowledge about internals. And metric  value
>>> itself depends on many aspects of internals. It leads to impossibility
>>> of interpretation. And it's good time to remove it (in AI 3.0 due to a
>>> backward compatibility).
>>> 
>>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <nizhikov.dev@gmail.com>
wrote:
>>>> 
>>>> Hello, Andrey.
>>>> 
>>>> The goal of the proposed metrics is to measure whole cache operations behavior.
>>>> It provides some kind of statistics(histograms) for it.
>>>> For more fine-grained analysis one will be use tracing or other «go deeper»
tools.
>>>> 
>>>>>> Measured for API calls on the caller node side
>>>>> Values will the same only for cases when node is remote relative to data
>>>> 
>>>> Yes, metrics will evaluate API call performance.
>>>> I think this is the most valuable information from a user's point of view.
>>>> 
>>>> Regular user wants to know how fast his cache operation performs.
>>>> And these metrics provide the answer.
>>>> 
>>>>> For regular data node (server node) timing will depend on answers for
question:
>>>> 
>>>> I think these answers are always available.
>>>> I barely can imagine a scenario when one monitor «black box» cluster and
don’t know it.
>>>> Even so, all answers are provided through system view we brought to the Ignite
:)
>>>> 
>>>>> What is transaction commit or rollback time?
>>>> 
>>>> These are metrics of client-side operation performance.
>>>> 
>>>> I think a specific user has knowledge - what are his transactions.
>>>> From these metrics it can answer on the question «If my transaction includes
cacheXXX, how long it usually takes?»
>>>> I think it’s very valuable knowledge.
>>>> 
>>>>> It will be implemented for most types of messages.
>>>> 
>>>> Good, let’s do it?
>>>> 
>>>>> So, from my point of view, commits for get/put/remove and commit/rollback
should be reverted.
>>>> 
>>>> I disagree here.
>>>> If you have a better approach to measure cache operations performance - please,
share your vision.
>>>> 
>>>>> 19 дек. 2019 г., в 16:03, Andrey Gura <agura@apache.org> написал(а):
>>>>> 
>>>>> From my point of view, Ignite should provide meaningful metrics for
>>>>> internal components that could be useful for monitoring and analysis.
>>>>> All suggested options are meaningless in a sense. Below I'll try
>>>>> explain why.
>>>>> 
>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls
on the caller node side.
>>>>>> Implemented in [1], commit [2].
>>>>> 
>>>>> All cache operations in Ignite are distributed. So each value measured
>>>>> for some cache operation will vary depending on where actually
>>>>> operation is performed. Values will the same only for cases when node
>>>>> is remote relative to data (e.g. client node).
>>>>> 
>>>>> For regular data node (server node) timing will depend on answers for
question:
>>>>> 
>>>>> - is node primary for particular key or not? (for all operations)
>>>>> - how many backups configured for the cache? (for put and remove)
>>>>> - what write synchronization mode is configured for particular cache?
>>>>> (for put and remove)
>>>>> - is readFromBackup enabled for the cache? (for get)
>>>>> 
>>>>> Both Ignite users and Ignite developers can't make any decision based
>>>>> on this metrics.
>>>>> 
>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on
the caller node side [3].
>>>>> 
>>>>> What is transaction commit or rollback time? How it calculates in
>>>>> Ignite now? What actions included into transaction? What actions not
>>>>> related with cache executed during transactions?
>>>>> 
>>>>> There is no any sense in time of transaction commit or rollback
>>>>> because there are no any way to understand what transaction was
>>>>> performed in particular period of time. Usually a lot of transactions
>>>>> and we can't to distinguish from each other.
>>>>> 
>>>>> Moreover, transaction usually treats as business operation. So only
>>>>> way to measure performance properly is measure business operation
>>>>> time. That is user should create own metrics set for some business
>>>>> API.
>>>>> 
>>>>> Further. What about cross cache transactions? At the moment tx
>>>>> commit/rollback time will be added to corresponding metrics per each
>>>>> cache evolved to the transaction. The *same time* for *each cache*.
>>>>> Absolutely meaningless.
>>>>> 
>>>>> Again, both Ignite users and Ignite developers can't make any decision
>>>>> based on this metrics. But users can create own metrics set.
>>>>> 
>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`,
`commit`, `rollback` messages on affinity nodes(primary and backups).
>>>>>> Ticket doesn't exist for it.
>>>>> 
>>>>> It will be implemented for most types of messages.
>>>>> 
>>>>> Metrics, application monitoring, performance analysis and measurement
>>>>> are a a little harder than it sounds. Therefore, we must approach this
>>>>> issue more carefully.
>>>>> Blindly adding new types of metrics will not only not improve the
>>>>> situation, but will also worsen the overall performance of the system
>>>>> because metric calculation always on the hot path.
>>>>> 
>>>>> So, from my point of view, commits for get/put/remove and
>>>>> commit/rollback should be reverted.
>>>>> 
>>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <nsamelchev@gmail.com>
wrote:
>>>>>> 
>>>>>> I think these metrics are useful.
>>>>>> 
>>>>>> I have prepared PR [1] for commit and rollback histograms. [2]
>>>>>> Nikolay, could you take a look, please?
>>>>>> 
>>>>>> If you do not mind, I will try to add affinity-nodes cache metrics:
>>>>>>>> * histograms that measure the time of processing `get`, `put`,
`remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't
exist for it.
>>>>>> 
>>>>>> I have filed a ticket for it. [3]
>>>>>> 
>>>>>> [1] https://github.com/apache/ignite/pull/7141
>>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-12450
>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12453
>>>>>> 
>>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <alexey.scherbakoff@gmail.com>:
>>>>>>> 
>>>>>>> I think they are very useful.
>>>>>>> 
>>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков
<nizhikov@apache.org>:
>>>>>>> 
>>>>>>>> Hello, Alexei.
>>>>>>>> 
>>>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35
label.
>>>>>>>> What do you think about proposed metrics set?
>>>>>>>> 
>>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
>>>>>>>> alexey.scherbakoff@gmail.com> написал(а):
>>>>>>>>> 
>>>>>>>>> Nikolay,
>>>>>>>>> 
>>>>>>>>> What about batch operations?
>>>>>>>>> 
>>>>>>>>> For messages processing the ticket does exist and even
has an
>>>>>>>>> implementation from before new metrics API times [1]
>>>>>>>>> 
>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418
>>>>>>>>> 
>>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков
<nizhikov@apache.org>:
>>>>>>>>> 
>>>>>>>>>> Hello, Igniters.
>>>>>>>>>> 
>>>>>>>>>> I want to provide the user answers to the following
question: "How cache
>>>>>>>>>> API operations perform?"
>>>>>>>>>> It seems, we need to implements metrics for basic
cache API operations
>>>>>>>>>> like get, put, remove for it.
>>>>>>>>>> 
>>>>>>>>>> I think we should provide the following metrics:
>>>>>>>>>> 
>>>>>>>>>> * `get`, `put`, `remove` time histograms. Measured
for API calls on the
>>>>>>>>>> caller node side.
>>>>>>>>>> Implemented in [1], commit [2].
>>>>>>>>>> 
>>>>>>>>>> * `commit`, `rollback` time histograms. Measured
for API calls on the
>>>>>>>>>> caller node side [3].
>>>>>>>>>> 
>>>>>>>>>> * histograms that measure the time of processing
`get`, `put`, `remove`,
>>>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary
and backups).
>>>>>>>>>> Ticket doesn't exist for it.
>>>>>>>>>> 
>>>>>>>>>> What do you think?
>>>>>>>>>> 
>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
>>>>>>>>>> [2]
>>>>>>>>>> 
>>>>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
>>>>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> Alexei Scherbakov
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Alexei Scherbakov
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Best wishes,
>>>>>> Amelchev Nikita
>>>> 
>> 


Mime
View raw message