cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wayne <wav...@gmail.com>
Subject Re: Node OOM Problems
Date Sat, 21 Aug 2010 09:18:00 GMT
I am already running with those options. I thought maybe that is why they
never get completed as they keep pushed pushed down in priority? I am
getting timeouts now and then but for the most part the cluster keeps
running. Is it normal/ok for the repair and compaction to take so long? It
has been over 12 hours since they were submitted.

On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis <jbellis@gmail.com> wrote:

> yes, the AES is the repair.
>
> if you are running linux, try adding the options to reduce compaction
> priority from
> http://wiki.apache.org/cassandra/PerformanceTuning
>
> On Sat, Aug 21, 2010 at 3:17 AM, Wayne <wav100@gmail.com> wrote:
> > I could tell from munin that the disk utilization was getting crazy high,
> > but the strange thing is that it seemed to "stall". The utilization went
> way
> > down and everything seemed to flatten out. Requests piled up and the node
> > was doing nothing. It did not "crash" but was left in a useless state. I
> do
> > not have access to the tpstats when that occurred. Attached is the munin
> > chart, and you can see the flat line after Friday at noon.
> >
> > I have reduced the writers from 10 per to 8 per node and they seem to be
> > still running, but I am afraid they are barely hanging on. I ran nodetool
> > repair after rebooting the failed node and I do not think the repair ever
> > completed. I also later ran compact on each node and some it finished but
> > some it did not. Below is the tpstats currently for the node I had to
> > restart. Is the AE-SERVICE-STAGE the repair and compaction queued up?  It
> > seems several nodes are not getting enough free cycles to keep up. They
> are
> > not timing out (30 sec timeout) for the most part but they are also not
> able
> > to compact. Is this normal? Do I just give it time? I am migrating 2-3 TB
> of
> > data from Mysql so the load is constant and will be for days and it seems
> > even with only 8 writer processes per node I am maxed out.
> >
> > Thanks for the advice. Any more pointers would be greatly appreciated.
> >
> > Pool Name                    Active   Pending      Completed
> > FILEUTILS-DELETE-POOL             0         0           1868
> > STREAM-STAGE                      1         1              2
> > RESPONSE-STAGE                    0         2      769158645
> > ROW-READ-STAGE                    0         0         140942
> > LB-OPERATIONS                     0         0              0
> > MESSAGE-DESERIALIZER-POOL         1         0     1470221842
> > GMFD                              0         0         169712
> > LB-TARGET                         0         0              0
> > CONSISTENCY-MANAGER               0         0              0
> > ROW-MUTATION-STAGE                0         1      865124937
> > MESSAGE-STREAMING-POOL            0         0              6
> > LOAD-BALANCER-STAGE               0         0              0
> > FLUSH-SORTER-POOL                 0         0              0
> > MEMTABLE-POST-FLUSHER             0         0           8088
> > FLUSH-WRITER-POOL                 0         0           8088
> > AE-SERVICE-STAGE                  1        34             54
> > HINTED-HANDOFF-POOL               0         0              7
> >
> >
> >
> > On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra <bill@dehora.net> wrote:
> >>
> >> On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote:
> >>
> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
> >> > MessageDeserializationTask.java (line 47) dropping message
> >> > (1,078,378ms past timeout)
> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
> >> > MessageDeserializationTask.java (line 47) dropping message
> >> > (1,078,378ms past timeout)
> >>
> >> MESSAGE-DESERIALIZER-POOL usually backs up when other stages are bogged
> >> downstream, (eg here's Ben Black describing the symptom when the
> >> underlying cause is running out of disk bandwidth, well worth a watch
> >> http://riptano.blip.tv/file/4012133/).
> >>
> >> Can you send all of nodetool tpstats?
> >>
> >> Bill
> >>
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Mime
View raw message