cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wayne <wav...@gmail.com>
Subject Re: Node OOM Problems
Date Sat, 21 Aug 2010 08:17:39 GMT
I could tell from munin that the disk utilization was getting crazy high,
but the strange thing is that it seemed to "stall". The utilization went way
down and everything seemed to flatten out. Requests piled up and the node
was doing nothing. It did not "crash" but was left in a useless state. I do
not have access to the tpstats when that occurred. Attached is the munin
chart, and you can see the flat line after Friday at noon.

I have reduced the writers from 10 per to 8 per node and they seem to be
still running, but I am afraid they are barely hanging on. I ran nodetool
repair after rebooting the failed node and I do not think the repair ever
completed. I also later ran compact on each node and some it finished but
some it did not. Below is the tpstats currently for the node I had to
restart. Is the AE-SERVICE-STAGE the repair and compaction queued up?  It
seems several nodes are not getting enough free cycles to keep up. They are
not timing out (30 sec timeout) for the most part but they are also not able
to compact. Is this normal? Do I just give it time? I am migrating 2-3 TB of
data from Mysql so the load is constant and will be for days and it seems
even with only 8 writer processes per node I am maxed out.

Thanks for the advice. Any more pointers would be greatly appreciated.

Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0           1868
STREAM-STAGE                      1         1              2
RESPONSE-STAGE                    0         2      769158645
ROW-READ-STAGE                    0         0         140942
LB-OPERATIONS                     0         0              0
MESSAGE-DESERIALIZER-POOL         1         0     1470221842
GMFD                              0         0         169712
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0              0
ROW-MUTATION-STAGE                0         1      865124937
MESSAGE-STREAMING-POOL            0         0              6
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0              0
MEMTABLE-POST-FLUSHER             0         0           8088
FLUSH-WRITER-POOL                 0         0           8088
AE-SERVICE-STAGE                  1        34             54
HINTED-HANDOFF-POOL               0         0              7



On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra <bill@dehora.net> wrote:

> On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote:
>
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
> > MessageDeserializationTask.java (line 47) dropping message
> > (1,078,378ms past timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
> > MessageDeserializationTask.java (line 47) dropping message
> > (1,078,378ms past timeout)
>
> MESSAGE-DESERIALIZER-POOL usually backs up when other stages are bogged
> downstream, (eg here's Ben Black describing the symptom when the
> underlying cause is running out of disk bandwidth, well worth a watch
> http://riptano.blip.tv/file/4012133/).
>
> Can you send all of nodetool tpstats?
>
> Bill
>
>

Mime
View raw message