ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: Lightweight profiling of messages processing
Date Tue, 27 Nov 2018 15:02:46 GMT
Alexey,

I would say that to implement this feature successfully, we first need to
clearly understand specific scenarios we want to target, and only then plan
implementation in small iterations. Please take in count the following
points:
1) Splitting by message type might be too fine-grained thing for the first
iteration. For example, we have N different message types processed in a
system pool. Add several messages causing long-running tasks (e.g. invoke),
and deque time will grow for all other message types for the same queue. I
think we can start with a simpler metrics - per-pool queueing time, pool
throughput, pool task latencies. This would be enough for a wide number of
use cases. We may expand it in future if needed.
2) Some IO statistics were already implemented as a part of IGNITE-6868
[1]. See TcpCommunicationSpiMBeanImpl. You may find it useful for your task.
3) Public API should be designed upfront. Normally it should include JMX
and system SQL views (e.g. like [2]). JMX is useful for external tools, SQL
views allow access to performance data from all platforms all at once. This
is critical.
4) It should be possible to enable/disable such metrics in runtime.
Probably infrastructure from IGNITE-369 [3] might be reused for this
purpose (see GridCacheProcessor.EnableStatisticsFuture)
5) IO access statistics are already under development [4]. First part will
be merged very soon.
5) Performance effects must be measured extremely carefully in all modes
(in-memory, persistence, background/log_only/fsync), because you are likely
to change very performance sensitive code pieces. We had a lot of
performance issues when implementing IGNITE-6868 [1].

Vladimir.

[1] https://issues.apache.org/jira/browse/IGNITE-6868
[1] https://issues.apache.org/jira/browse/IGNITE-7700
[3] https://issues.apache.org/jira/browse/IGNITE-369
[4]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-27%3A+Page+IO+statistics

On Tue, Nov 27, 2018 at 5:04 PM Alexey Kukushkin <kukushkinalexey@gmail.com>
wrote:

> Hi Alexei,
>
> Did you consider general-purpose APM
> <https://en.wikipedia.org/wiki/Application_performance_management> tools
> like free InspectIT <https://github.com/inspectIT/inspectIT> or commercial
> DynaTrace <https://www.dynatrace.com/> or AppDynamics
> <https://www.appdynamics.com/>?
>
> Java APM tools do not require writing code or even instrumenting Ignite
> binaries: you just attach them as javaagent
> <https://zeroturnaround.com/rebellabs/how-to-inspect-classes-in-your-jvm/>
> to
> Ignite JVM and then you can configure "sensors" to track whatever API you
> want. APM tools collect metrics from the sensors and provide sophisticated
> analysis tools.
>
> DynaTrace claims they have 2%-7% overhead (depending on application) but
> you can always detach the tool if you do not always need it in production.
>
> I did not try APM for Ignite myself but it might work well.
>
> On Tue, Nov 27, 2018 at 4:37 PM Alexei Scherbakov <
> alexey.scherbakoff@gmail.com> wrote:
>
> > Igniters,
> >
> > At work I often have to solve performance issues with Ignite cluster
> > without having access to source code of running user application.
> >
> > Looks like Ignite have limited capabilities to identify bottlenecks
> without
> > extensive profiling on server and client side (JFR recording , sampling
> > profilers, regular thread dumps,  etc), which is not always possible.
> >
> > Even having profiling data not always helpful for determining several
> types
> > of bottlenecks, on example, if where is a contention on single
> > key/partition.
> >
> > I propose to implement new feature, like lightweight profiling of message
> > processing.
> >
> > The feature will allow to have view on message processing statistics for
> > each node and for all grid nodes.
> >
> > In short, it's necessary to track each message execution in executors and
> > record execution statistics like synchronous execution time in executor
> > thread and waiting time in queue.
> >
> > Full description:
> >
> > 1. Implement detailed tracking of message waiting in queue and actual
> > processing by executors with splitting to several time bins. Example of
> > detailed statistics for each processed message:
> >
> >
> > Processing time(%)
> >                                  Message       Total     Average(ms)
> >      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms
  <
> > 500ms   < 750ms  < 1000ms  > 1000ms
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >                 GridNearSingleGetRequest   904311372        0.023000
> >   904240521     57394      7242      3961      1932       229
> > 61        24         4         4
> >                GridNearSingleGetResponse   340134416        0.041000
> >   340118791     11660      1167       729       901      1001
> > 158         8         1         0
> >                      GridNearLockRequest    77088689        0.079000
> >    77073458     11945      2299       643       311        31
> > 2         0         0         0
> >  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
> >    39580914     28222      6469      4638      9870     13414
> > 2087       137         1         0
> >         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
> >    37579375     23247      5915      4210      8954     12917
> > 2048       162         1         0
> >      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
> >    33579805       337        51         3         1         1
> > 0         0         0         0
> >                 GridNearTxPrepareRequest    21667790        0.238000
> >    21078069    580126      1622      1261      2531      3631
> > 496        40         0        14
> >                 GridDhtTxPrepareResponse    20949873        0.316000
> >    17130161   3803105      4615      3162      4489      3721
> > 577        34         1         8
> >                  GridDhtTxPrepareRequest    20949838        0.501000
> >    16158732   4750217     16183      5735      8472      8994
> > 1353        88        11        53
> >                  GridDhtTxFinishResponse    13835065        0.007000
> >    13832519      2476        27        28        14         1
> > 0         0         0         0
> >                   GridDhtTxFinishRequest    13835028        0.547000
> >    12084106   1736789      8971      2340      1792       807
> > 118        41         4        60
> >                  GridNearTxFinishRequest    13762197        0.725000
> >    11811828   1942499      4441      1400      1201       524
> > 89        34        19       162
> >                GridDhtAtomicNearResponse     2784422        0.122000
> >     2783393      1022         5         2         0         0
> > 0         0         0         0
> >                       GridNearGetRequest     2360483        0.484000
> >     2345937     14129       244       101        64         8
> > 0         0         0         0
> >                      GridNearGetResponse     1984243        0.054000
> >     1981905      2327         8         1         1         1
> > 0         0         0         0
> >                GridNearTxPrepareResponse      192856        0.153000
> >      192660       188         1         5         1         1
> > 0         0         0         0
> >                     GridNearLockResponse      192780        0.091000
> >      192667       107         3         0         3         0
> > 0         0         0         0
> >                 GridNearTxFinishResponse         177        0.822000
> >         129        47         1         0         0         0
> > 0         0         0         0
> >        GridNearAtomicSingleUpdateRequest         124        4.803000
> >          52        53        19         0         0         0
> > 0         0         0         0
> >             GridNearAtomicUpdateResponse         120        0.448000
> >         110        10         0         0         0         0
> > 0         0         0         0
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >                                           1544912252
> >  1531765132  12965900     59283     28219     40537     45281
> > 6989       568        42       301
> >
> >
> > Queue waiting time(%)
> >                                  Message       Total     Average(ms)
> >      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms
  <
> > 500ms   < 750ms  < 1000ms  > 1000ms
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >                 GridNearSingleGetRequest   904311372        0.023000
> >   903130400   1051499     33248     18049     34044     38809
> > 4714       212         8       389
> >                GridNearSingleGetResponse   340134416        0.041000
> >   339708055    419479      3732      1004      1222       737
> > 117         2         1        67
> >                      GridNearLockRequest    77088689        0.079000
> >    76949792    129666      2003      1624      2585      2531
> > 390        18         2        78
> >  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
> >    39572801     53913      3949      2960      5487      5583
> > 963        45         5        46
> >         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
> >    37561043     57294      3721      2736      5339      5716
> > 879        54         6        41
> >      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
> >    33512734     45873      4427      3340      6342      6363
> > 1033        50        10        26
> >                 GridNearTxPrepareRequest    21667790        0.238000
> >    21584416     76283      1681      1026      1948      2027
> > 330        31         2        46
> >                 GridDhtTxPrepareResponse    20949873        0.316000
> >    20862632     83578      1214       633       821       817
> > 118        26         8        26
> >                  GridDhtTxPrepareRequest    20949838        0.501000
> >    20867551     72264      2265      1516      2773      2905
> > 479        38         6        41
> >                  GridDhtTxFinishResponse    13835065        0.007000
> >    13808274     26487       150        68        55        21
> > 3         1         0         6
> >                   GridDhtTxFinishRequest    13835028        0.547000
> >    13809380     25142       188       127       102        59
> > 9         2         4        15
> >                  GridNearTxFinishRequest    13762197        0.725000
> >    13743925     17365       241       171       278       191
> > 16         4         0         6
> >                GridDhtAtomicNearResponse     2784422        0.122000
> >     2780600      3767        29         9         7         9
> > 1         0         0         0
> >                       GridNearGetRequest     2360483        0.484000
> >     2355963      3762       194       114       184       228
> > 38         0         0         0
> >                      GridNearGetResponse     1984243        0.054000
> >     1979404      4721        31         9        49        28
> > 1         0         0         0
> >                GridNearTxPrepareResponse      192856        0.153000
> >      192852         2         0         1         1         0
> > 0         0         0         0
> >                     GridNearLockResponse      192780        0.091000
> >      192777         3         0         0         0         0
> > 0         0         0         0
> >                 GridNearTxFinishResponse         177        0.822000
> >         177         0         0         0         0         0
> > 0         0         0         0
> >        GridNearAtomicSingleUpdateRequest         124        4.803000
> >         124         0         0         0         0         0
> > 0         0         0         0
> >             GridNearAtomicUpdateResponse         120        0.448000
> >         119         1         0         0         0         0
> > 0         0         0         0
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >                                           1544912252
> >  1542613019   2071099     57073     33387     61237     66024
> > 9091       483        52       787
> >
> >
> > This will allow to understand cluster load type and presence of problems
> > with increased latency in message processing. In general all messages
> > should be processed within time < 30ms.
> >
> > 2. Log messages which are waiting too long or have been processed for too
> > long (this is configured threshold).
> >
> > Track the following: message enqueue time, waiting in queue until
> > processing is started, head of queue before starting to wait, queue size
> > before starting to wait, total waiting time, total processing time, queue
> > size after processing, message content. Example:
> >
> > Slow message: *enqueueTs*=2018-11-27 15:10:22.241, *waitTime*=0.048,
> > *procTime*=305.186, *messageId*=3a3064a9, *queueSzBefore*=0,
> > *headMessageId*=null,
> > *queueSzAfter*=0, *message*=GridNearTxFinishRequest [miniId=1,
> > mvccSnapshot=null, super=GridDistributedTxFinishRequest
> > [topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0],
> > futId=199a3155761-f379f312-ad4b-4181-acc5-0aacb3391f07, threadId=296,
> > commitVer=null, invalidate=false, commit=true, baseVer=null, txSize=0,
> > sys=false, plc=2, subjId=dda703a0-69ee-47cf-9b9a-bf3dc9309feb,
> > taskNameHash=0, flags=32, syncMode=FULL_SYNC, txState=IgniteTxStateImpl
> > [activeCacheIds=[644280847], recovery=false, mvccEnabled=false,
> > txMap=HashSet [IgniteTxEntry [key=KeyCacheObjectImpl [part=8, val=8,
> > hasValBytes=true], cacheId=644280847, txKey=IgniteTxKey
> > [key=KeyCacheObjectImpl [part=8, val=8, hasValBytes=true],
> > cacheId=644280847], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
> > oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
> > conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
> > filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=false,
> > entry=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8,
> > hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0,
> > nodeOrder=0], hash=8, extras=GridCacheObsoleteEntryExtras
> > [obsoleteVer=GridCacheVersion [topVer=2147483647, order=0, nodeOrder=0]],
> > flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
> > [rdrs=ReaderId[] [], part=8, super=], prepared=0, locked=false,
> > nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
> > flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
> > [topVer=154800617, order=1543320621588, nodeOrder=1]], IgniteTxEntry
> > [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true],
> > cacheId=644280847, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=11,
> > val=11, hasValBytes=true], cacheId=644280847], val=[op=READ, val=null],
> > prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
> > entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
> > explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [],
> > filtersPassed=false, filtersSet=false, entry=GridCacheMapEntry
> > [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true], val=null,
> > ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=11,
> > extras=GridCacheObsoleteEntryExtras [obsoleteVer=GridCacheVersion
> > [topVer=2147483647, order=0, nodeOrder=0]],
> > flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
> > [rdrs=ReaderId[] [], part=11, super=], prepared=0, locked=false,
> > nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
> > flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
> > [topVer=154800617, order=1543320621588, nodeOrder=1]]]],
> > super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=154800617,
> > order=1543320621587, nodeOrder=4], committedVers=null,
> rolledbackVers=null,
> > cnt=0, super=GridCacheIdMessage [cacheId=0]]]]
> >
> > This will allow to understand type of activity which produces latency
> > issues and possibly identify culprit code.
> >
> > 3. Track all synchronous waits while processing message: checkpoint
> locks,
> > segment locks, page locks, cold page reads from disk, etc.
> >
> > 4. Allow extensions to collection additional statistics for specific
> > message type. On example, for *TxFinishRequest we could track total
> > entities written for each cache and size in bytes. For
> GridNearLockRequest
> > we could track keys with max lock queue length for identifying
> contention.
> >
> > 5. Implement additional logging, JMX metrics and control.sh command for
> > printing currently collected profiling data.
> >
> > 6. Implementation should be as lightweight as possible.
> >
> > Maybe something else.
> >
> > BTW, I already started working on it and want to contribute my work to
> > community. Ticket with PR (not finished) [1], check for
> > CacheMessageStatsTest and CacheMessageStatsIndexingTest
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-10418
> >
> > Thoughts ?
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
> >
>
>
> --
> Best regards,
> Alexey
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message