Thank you for your reply. I will check the thread pools. According to thread pools description in ignite docs problem may be in Striped Pool.

In my case I have a lot of writes and a small number of reads.
And in case while writes and reads processing through one queue i will have this problem all the time.

If the problem in striped  pool, does there any way to split processing of reads and writes into separate thread pools?

Evgeny Pryakhin


9 июля 2019 г., в 18:13, Ilya Kasnacheev <ilya.kasnacheev@gmail.com> написал(а):

Hello!

I think you should collect thread dumps from all nodes to locate the bottleneck. Then, maybe you need to adjust thread pool sizes.

My idea here is that some thread pool (stripe, probably) gets full with persistent cache writing operations, and reads have to wait in queue.

Regards,
--
Ilya Kasnacheev


вт, 9 июл. 2019 г. в 18:08, Evgeny Pryakhin <breath1988@gmail.com>:
Hello. I need some help or advise on my problem. 

Preface: 
- Ignite 2.5 (I have no options about upgrade to newer version) 
- cluster with 8 servers, 4 CPU and 64GB RAM each, HDD (not SSD). Operating system was tuned according to performance guide.
- two memory regions configured: one in-memory only (500MB) and one with persistence enabled (about 40GB memory). 
- one cache in in-memory region (about 300k records), backups - 3. Write mode: PRIMARY_SINC. 
- one cache in region with persistence (about 50M records), backups 3. Write mode: PRIMARY_SINC. 
- Ignite Thin Client as a driver. 

Scenario: 
- I have batch writes on first in memory cache - about 500/sec. Continuously. 
- I have a lot of reads on first in-memory cache - about 3k/sec. Continuously. 
- I have a lot of batch writes on second persistent cache. Batch size is about 1k records. Continuously. 

The Problem: 
- when I have batch writes to the second (persistent) cache disabled reads from first cache works well with small latency - <1ms. 
- when batch writes to persistent cache is turned on - reads from the first cache become very slow - about 200-300ms. 

I have no ideas how even to start investigation on this problem. May be I can check some metrics of cluster or system metrics on harware servers to find the right way to solve my problem. Do you have ant ideas about this?