Ok I have two test servers, they are RH and pretty nice. I have two problems with one of them and none with the other. Same configuration but the seed and listen address that is their opposites. Nothing fancy. RF=2

 

All info I can get is also here and some more like conf, 590 rows

http://pastie.org/1131106

 

Problem nr 1 and the most annoying one.

I by emptying the data folder and commitlog folder and start the servers.

 

I write data to both nodes, this time CL.ONE but happen when CL.ALL aswell. The node that is troubling me is not writing memory to disc. As soon it is time to do that it just starts to GC and doing that for a long time and then enqueuing the flush and not write, its unresponsive during gc storms. The other node works just as expected, it takes the memory and writes it down in a matter of seconds, this is not a lot of memory and no reads.

 

Log from troubling node:

------------------------------------------

 INFO 10:42:26,842 GC for ParNew: 808 ms, 106688440 reclaimed leaving 7273866048 used; max is 17388929024

 INFO 10:42:31,613 GC for ParNew: 882 ms, 120705376 reclaimed leaving 7292752352 used; max is 17388929024

 INFO 10:42:32,615 GC for ParNew: 621 ms, 108181664 reclaimed leaving 7324162368 used; max is 17388929024

 INFO 10:42:35,468 GC for ParNew: 732 ms, 107646952 reclaimed leaving 7407855104 used; max is 17388929024

 INFO 10:42:36,540 GC for ParNew: 556 ms, 106819200 reclaimed leaving 7440627584 used; max is 17388929024

 INFO 10:42:38,348 GC for ParNew: 676 ms, 111891904 reclaimed leaving 7490450648 used; max is 17388929024

 INFO 10:42:39,413 GC for ParNew: 768 ms, 110205856 reclaimed leaving 7519836472 used; max is 17388929024

 INFO 10:42:40,671 GC for ParNew: 755 ms, 112034384 reclaimed leaving 7547393768 used; max is 17388929024

 INFO 10:42:41,884 GC for ParNew: 834 ms, 108972528 reclaimed leaving 7578012920 used; max is 17388929024

 INFO 10:42:43,102 GC for ParNew: 971 ms, 110778800 reclaimed leaving 7606825800 used; max is 17388929024

 INFO 10:42:44,391 GC for ParNew: 1076 ms, 109996232 reclaimed leaving 7636421248 used; max is 17388929024

 ------------------------------------------

I had trouble copy pasting all of the data running the server remotely with putty.

 

Ring

Address       Status     Load          Range                                      Ring

                                       142713423890871059377105093567732377974

x.x.x.211 Up         486 bytes     45911723912241754468195357739525604647     |<--|

x.x.x.209 Up         501.23 MB     142713423890871059377105093567732377974    |-->|

 

tpstats from node that wont wake up from this state.

 

When doing the ParNew

 

Pool Name                    Active   Pending      Completed

STREAM-STAGE                      0         0              0

RESPONSE-STAGE                    0         0        1003801

ROW-READ-STAGE                    0         0              0

LB-OPERATIONS                     0         0              0

MISCELLANEOUS-POOL                0         0              0

GMFD                              0         0           1047

LB-TARGET                         0         0              0

CONSISTENCY-MANAGER               0         0              0

ROW-MUTATION-STAGE               32    183026        1035233

MESSAGE-STREAMING-POOL            0         0              0

LOAD-BALANCER-STAGE               0         0              0

FLUSH-SORTER-POOL                 0         0              0

MEMTABLE-POST-FLUSHER             1         2              1

FLUSH-WRITER-POOL                 1         2              1

AE-SERVICE-STAGE                  0         0              0

HINTED-HANDOFF-POOL               0         0              2

 

When done with ParNew

 

Pool Name                    Active   Pending      Completed

STREAM-STAGE                      0         0              0

RESPONSE-STAGE                    0         0        1003801

ROW-READ-STAGE                    0         0              0

LB-OPERATIONS                     0         0              0

MISCELLANEOUS-POOL                0         0              0

GMFD                              0         0          17617

LB-TARGET                         0         0              0

CONSISTENCY-MANAGER               0         0              0

ROW-MUTATION-STAGE                0         0        1218212

MESSAGE-STREAMING-POOL            0         0              0

LOAD-BALANCER-STAGE               0         0              0

FLUSH-SORTER-POOL                 0         0              0

MEMTABLE-POST-FLUSHER             1         2              2

FLUSH-WRITER-POOL                 1         2              2

AE-SERVICE-STAGE                  0         0              0

HINTED-HANDOFF-POOL               1         1              3

 

It is not that it is writing slowly but that is not writing at all, ever or extremely slowly I think it is writing from gossip not connections to the node. And not any amount and it has nothing to do with swapping or the 16gb it is allowed to use. The data is much smaller than this and it happens when first write of memtable is supposed to happen, the other node starts just at the same moment but it finishes and doesn’t loop. If I restart the server it will write from the commitlog the data to datafolder and then stop working as soon as it is going to write new data from memtable.

 

The other problem with the same node is that if I use JNA it will kernel crash after out of memory error and it uses about all the 60gb ram although I told the jvm max 16gb. Its unresponsive from start and the whole server locks before making getting information hard to get but we know it is kernel crash because of oom.

 

If anyone have an idea about what is wrong it would help a lot.

/Justus

 

AB SVENSKA SPEL
106 10 Stockholm
Sturegatan 11, Sundbyberg
Växel +46 8 757 77 00
http://svenskaspel.se