ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ignite TC Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-6587) Ignite watchdog service
Date Wed, 27 Mar 2019 14:03:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802823#comment-16802823
] 

Ignite TC Bot commented on IGNITE-6587:
---------------------------------------

{panel:title=--&gt; Run :: All: Possible Blockers|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1}
{color:#d04437}Platform .NET (Core Linux){color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3395254]]

{color:#d04437}ZooKeeper (Discovery) 1{color} [[tests 0 TIMEOUT , Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3395256]]
* ZookeeperDiscoverySpiTest.testDisconnectOnServersLeft_3 (last started)

{color:#d04437}Client Nodes{color} [[tests 0 TIMEOUT , Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3395258]]
* IgniteClientRejoinTest.testClientsReconnect (last started)

{color:#d04437}Cache 3{color} [[tests 0 TIMEOUT , Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3395266]]
* IgniteCacheGroupsTest.testRestartsAndCacheCreateDestroy (last started)

{color:#d04437}Platform C++ (Linux Clang){color} [[tests 0 Exit Code , Failure on metric |https://ci.ignite.apache.org/viewLog.html?buildId=3395274]]

{color:#d04437}Hibernate 5.3{color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3395282]]

{color:#d04437}Thin client: PHP{color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3395280]]

{color:#d04437}Thin client: Node.js{color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3395286]]

{color:#d04437}Thin client: Python{color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3395290]]

{color:#d04437}Spring (Data){color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3395294]]

{color:#d04437}Cache 1{color} [[tests 11|https://ci.ignite.apache.org/viewLog.html?buildId=3395262]]
* IgniteBinaryCacheTestSuite: DataStreamerClientReconnectAfterClusterRestartTest.testTwoClientsAllowOverwrite
- 0,0% fails in last 405 master runs.
* IgniteBinaryCacheTestSuite: DataStreamerClientReconnectAfterClusterRestartTest.testOneClientAllowOverwrite
- 0,0% fails in last 405 master runs.
* IgniteBinaryCacheTestSuite: DataStreamerClientReconnectAfterClusterRestartTest.testTwoClients
- 0,0% fails in last 405 master runs.
* IgniteBinaryCacheTestSuite: DataStreamerClientReconnectAfterClusterRestartTest.testOneClient
- 0,0% fails in last 405 master runs.

{color:#d04437}Queries 1{color} [[tests 6|https://ci.ignite.apache.org/viewLog.html?buildId=3395260]]
* IgniteBinaryCacheQueryTestSuite: SchemaExchangeSelfTest.testServerRestartWithNewTypes -
0,0% fails in last 409 master runs.

{color:#d04437}PDS (Indexing){color} [[tests 4 Out Of Memory Error |https://ci.ignite.apache.org/viewLog.html?buildId=3395264]]
* IgnitePdsWithIndexingCoreTestSuite: IgniteLogicalRecoveryTest.testRecoveryOnJoinToDifferentBlt
- 0,0% fails in last 398 master runs.
* IgnitePdsWithIndexingCoreTestSuite: IgniteLogicalRecoveryTest.testRecoveryOnDynamicallyStartedCaches
- 0,0% fails in last 398 master runs.
* IgnitePdsWithIndexingCoreTestSuite: IgnitePdsThreadInterruptionTest.testInterruptsOnWALWrite
- 0,0% fails in last 398 master runs.
* IgniteLogicalRecoveryTest.testRecoveryOnDynamicallyStartedCaches (last started)

{color:#d04437}Queries 2{color} [[tests 14|https://ci.ignite.apache.org/viewLog.html?buildId=3395268]]
* IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentTransactionalReplicatedSelfTest.testClientReconnectWithCacheRestart
- 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentAtomicPartitionedSelfTest.testClientReconnectWithCacheRestart
- 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: DynamicIndexPartitionedTransactionalConcurrentSelfTest.testClientReconnectWithCacheRestart
- 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentTransactionalPartitionedSelfTest.testClientReconnectWithNonDynamicCacheRestart
- 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentTransactionalPartitionedSelfTest.testClientReconnectWithCacheRestart
- 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: IgniteCacheQueryNodeRestartSelfTest2.testRestarts - 0,0%
fails in last 0 master runs.
* IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentAtomicReplicatedSelfTest.testClientReconnectWithNonDynamicCacheRestart
- 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: DynamicIndexReplicatedAtomicConcurrentSelfTest.testClientReconnectWithCacheRestart
- 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentAtomicReplicatedSelfTest.testClientReconnectWithCacheRestart
- 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: DynamicIndexPartitionedAtomicConcurrentSelfTest.testClientReconnectWithCacheRestart
- 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentTransactionalReplicatedSelfTest.testClientReconnectWithNonDynamicCacheRestart
- 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: DynamicIndexReplicatedTransactionalConcurrentSelfTest.testClientReconnectWithCacheRestart
- 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentAtomicPartitionedSelfTest.testClientReconnectWithNonDynamicCacheRestart
- 0,0% fails in last 414 master runs.

{color:#d04437}ZooKeeper (Discovery) 2{color} [[tests 5|https://ci.ignite.apache.org/viewLog.html?buildId=3395270]]
* ZookeeperDiscoverySpiTestSuite2: IgniteClientReconnectCacheTest.testReconnectClusterRestart
- 0,0% fails in last 406 master runs.
* ZookeeperDiscoverySpiTestSuite2: IgniteClientDataStructuresTest.testSequence
* ZookeeperDiscoverySpiTestSuite2: IgniteClientReconnectCacheTest.testReconnectCacheDestroyedAndCreated
- 0,0% fails in last 406 master runs.
* ZookeeperDiscoverySpiTestSuite2: GridCacheReplicatedNodeRestartSelfTest.testRestartWithTxEightNodesTwoBackups

{color:#d04437}Cache 2{color} [[tests 3|https://ci.ignite.apache.org/viewLog.html?buildId=3395272]]
* IgniteCacheTestSuite2: IgniteCacheClientNodeChangingTopologyTest.testPessimisticTx2 - 0,0%
fails in last 405 master runs.
* IgniteCacheTestSuite2: IgniteCacheClientNodeChangingTopologyTest.testOptimisticTxPutAllMultinode
- 0,0% fails in last 405 master runs.
* IgniteCacheTestSuite2: IgniteClientCacheStartFailoverTest.testClientStartLastServerFailsTx
- 0,0% fails in last 405 master runs.

{color:#d04437}Continuous Query 1{color} [[tests 2|https://ci.ignite.apache.org/viewLog.html?buildId=3395278]]
* IgniteCacheQuerySelfTestSuite3: CacheContinuousWithTransformerReplicatedSelfTest.testContinuousWithTransformerAndRegularListenerAsync
- 0,0% fails in last 413 master runs.
* IgniteCacheQuerySelfTestSuite3: CacheContinuousQueryConcurrentPartitionUpdateTest.testConcurrentUpdatesAndQueryStartTx
- 0,0% fails in last 413 master runs.

{color:#d04437}Web Sessions{color} [[tests 4|https://ci.ignite.apache.org/viewLog.html?buildId=3395288]]
* IgniteWebSessionSelfTestSuite: WebSessionSelfTest.testClientReconnectRequest - 0,0% fails
in last 412 master runs.

{color:#d04437}Basic 3{color} [[tests 1|https://ci.ignite.apache.org/viewLog.html?buildId=3395292]]
* IgniteBasicWithPersistenceTestSuite: PluginNodeValidationTest.testValidationException

{color:#d04437}Platform C++ (Win x64 | Release){color} [[tests 5 Failure on metric , BuildFailureOnMessage
|https://ci.ignite.apache.org/viewLog.html?buildId=3395276]]
* IgniteOdbcTest: QueriesTestSuite: TestManyCursorsSelectMerge2 - 0,6% fails in last 824 master
runs.
* IgniteOdbcTest: QueriesTestSuite: TestManyCursorsTwoSelects2 - 0,6% fails in last 824 master
runs.
* IgniteOdbcTest: QueriesTestSuite: TestInsertBatchSelect2049 - 0,6% fails in last 824 master
runs.
* IgniteOdbcTest: QueriesTestSuite: TestInsertBatchSelect100 - 0,6% fails in last 824 master
runs.
* IgniteOdbcTest: QueriesTestSuite: TestNotFullInsertBatchSelect1500 - 0,6% fails in last
824 master runs.

{panel}
[TeamCity *--&gt; Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=3372451&amp;buildTypeId=IgniteTests24Java8_RunAll]

> Ignite watchdog service
> -----------------------
>
>                 Key: IGNITE-6587
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6587
>             Project: Ignite
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 2.2
>            Reporter: Alexey Goncharuk
>            Assignee: Andrey Kuznetsov
>            Priority: Major
>              Labels: IEP-5
>             Fix For: 2.7
>
>         Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical threads. We should
implement a periodic check that calls failure handler when one of the following conditions
has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task extracted from
task queue.
> In case of failure condition, call stacks of all threads should be logged before invoking
failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this thread become
a vulnerable point with respect to thread termination and CPU resource starvation. So we are
to use self-monitoring approach: critical threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that fits best to
store and track system critical threads. All of them should be refactored to be {{GridWorker's}}
and added to {{WorkersRegistry}}. Each worker should periodically choose some subset of peer
workers and check whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement latter check.
Additionally, infinite queue polls, waits on monitors or thread parks should be refactored
to their timed equivalents in system critical threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' threshold, etc.)
are to be set via system properties.
> [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message