Ok I have two test servers, they are RH and
pretty nice. I have two problems with one of them and none with the other. Same
configuration but the seed and listen address that is their opposites. Nothing
fancy. RF=2
All info I can get is also here and some
more like conf, 590 rows
Problem nr 1 and the most annoying one.
I by emptying the data folder and commitlog
folder and start the servers.
I write data to both nodes, this time
CL.ONE but happen when CL.ALL aswell. The node that is troubling me is not
writing memory to disc. As soon it is time to do that it just starts to GC and
doing that for a long time and then enqueuing the flush and not write, its unresponsive
during gc storms. The other node works just as expected, it takes the memory
and writes it down in a matter of seconds, this is not a lot of memory and no
reads.
Log from troubling node:
------------------------------------------
INFO 10:42:26,842 GC for ParNew: 808
ms, 106688440 reclaimed leaving 7273866048 used; max is 17388929024
INFO 10:42:31,613 GC for ParNew: 882
ms, 120705376 reclaimed leaving 7292752352 used; max is 17388929024
INFO 10:42:32,615 GC for ParNew: 621
ms, 108181664 reclaimed leaving 7324162368 used; max is 17388929024
INFO 10:42:35,468 GC for ParNew: 732
ms, 107646952 reclaimed leaving 7407855104 used; max is 17388929024
INFO 10:42:36,540 GC for ParNew: 556
ms, 106819200 reclaimed leaving 7440627584 used; max is 17388929024
INFO 10:42:38,348 GC for ParNew: 676
ms, 111891904 reclaimed leaving 7490450648 used; max is 17388929024
INFO 10:42:39,413 GC for ParNew: 768
ms, 110205856 reclaimed leaving 7519836472 used; max is 17388929024
INFO 10:42:40,671 GC for ParNew: 755
ms, 112034384 reclaimed leaving 7547393768 used; max is 17388929024
INFO 10:42:41,884 GC for ParNew: 834
ms, 108972528 reclaimed leaving 7578012920 used; max is 17388929024
INFO 10:42:43,102 GC for ParNew: 971
ms, 110778800 reclaimed leaving 7606825800 used; max is 17388929024
INFO 10:42:44,391 GC for ParNew: 1076
ms, 109996232 reclaimed leaving 7636421248 used; max is 17388929024
------------------------------------------
I had trouble copy pasting all of the data
running the server remotely with putty.
Ring
Address
Status
Load
Range
Ring
142713423890871059377105093567732377974
x.x.x.211 Up
486 bytes
45911723912241754468195357739525604647 |<--|
x.x.x.209 Up
501.23 MB
142713423890871059377105093567732377974 |-->|
tpstats from node that wont wake up from
this state.
When doing the ParNew
Pool
Name
Active Pending Completed
STREAM-STAGE
0
0
0
RESPONSE-STAGE
0
0 1003801
ROW-READ-STAGE
0
0
0
LB-OPERATIONS
0
0
0
MISCELLANEOUS-POOL
0
0
0
GMFD
0 0
1047
LB-TARGET
0
0
0
CONSISTENCY-MANAGER
0
0
0
ROW-MUTATION-STAGE
32 183026 1035233
MESSAGE-STREAMING-POOL
0
0
0
LOAD-BALANCER-STAGE
0
0
0
FLUSH-SORTER-POOL
0
0
0
MEMTABLE-POST-FLUSHER
1
2
1
FLUSH-WRITER-POOL
1
2
1
AE-SERVICE-STAGE
0
0
0
HINTED-HANDOFF-POOL
0
0
2
When done with ParNew
Pool
Name
Active Pending Completed
STREAM-STAGE
0
0
0
RESPONSE-STAGE
0
0 1003801
ROW-READ-STAGE
0
0
0
LB-OPERATIONS
0
0
0
MISCELLANEOUS-POOL
0
0
0
GMFD
0 0
17617
LB-TARGET
0 0
0
CONSISTENCY-MANAGER
0
0
0
ROW-MUTATION-STAGE
0
0 1218212
MESSAGE-STREAMING-POOL
0
0
0
LOAD-BALANCER-STAGE
0
0 0
FLUSH-SORTER-POOL
0
0
0
MEMTABLE-POST-FLUSHER
1
2
2
FLUSH-WRITER-POOL
1
2
2
AE-SERVICE-STAGE
0
0
0
HINTED-HANDOFF-POOL
1
1
3
It is not that it is writing slowly but
that is not writing at all, ever or extremely slowly I think it is writing from
gossip not connections to the node. And not any amount and it has nothing to do
with swapping or the 16gb it is allowed to use. The data is much smaller than
this and it happens when first write of memtable is supposed to happen, the
other node starts just at the same moment but it finishes and doesn’t loop.
If I restart the server it will write from the commitlog the data to datafolder
and then stop working as soon as it is going to write new data from memtable.
The other problem with the same node is
that if I use JNA it will kernel crash after out of memory error and it uses
about all the 60gb ram although I told the jvm max 16gb. Its unresponsive from
start and the whole server locks before making getting information hard to get
but we know it is kernel crash because of oom.
If anyone have an idea about what is wrong
it would help a lot.
/Justus
AB SVENSKA SPEL
106 10 Stockholm
Sturegatan 11,
Sundbyberg
Växel +46 8 757 77 00
http://svenskaspel.se