incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Black...@b3k.us>
Subject Re: Node OOM Problems
Date Sat, 21 Aug 2010 21:03:27 GMT
My guess is that you have (at least) 2 problems right now:

You are writing 10k ops/sec to each node, but have default memtable
flush settings.  This is resulting in memtable flushing every 30
seconds (default ops flush setting is 300k).  You thus have a
proliferation of tiny sstables and are seeing minor compactions
triggered every couple of minutes.

You have started a major compaction which is now competing with those
near constant minor compactions for far too little I/O (3 SATA drives
in RAID0, perhaps?).  Normally, this would result in a massive
ballooning of your heap use as all sorts of activities (like memtable
flushes) backed up, as well.

I suggest you increase the memtable flush ops to at least 10 (million)
if you are going to sustain that many writes/sec, along with an
increase in the flush MB to match, based on your typical bytes/write
op.  Long term, this level of write activity demands a lot faster
storage (iops and bandwidth).


b
On Sat, Aug 21, 2010 at 2:18 AM, Wayne <wav100@gmail.com> wrote:
> I am already running with those options. I thought maybe that is why they
> never get completed as they keep pushed pushed down in priority? I am
> getting timeouts now and then but for the most part the cluster keeps
> running. Is it normal/ok for the repair and compaction to take so long? It
> has been over 12 hours since they were submitted.
>
> On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>>
>> yes, the AES is the repair.
>>
>> if you are running linux, try adding the options to reduce compaction
>> priority from
>> http://wiki.apache.org/cassandra/PerformanceTuning
>>
>> On Sat, Aug 21, 2010 at 3:17 AM, Wayne <wav100@gmail.com> wrote:
>> > I could tell from munin that the disk utilization was getting crazy
>> > high,
>> > but the strange thing is that it seemed to "stall". The utilization went
>> > way
>> > down and everything seemed to flatten out. Requests piled up and the
>> > node
>> > was doing nothing. It did not "crash" but was left in a useless state. I
>> > do
>> > not have access to the tpstats when that occurred. Attached is the munin
>> > chart, and you can see the flat line after Friday at noon.
>> >
>> > I have reduced the writers from 10 per to 8 per node and they seem to be
>> > still running, but I am afraid they are barely hanging on. I ran
>> > nodetool
>> > repair after rebooting the failed node and I do not think the repair
>> > ever
>> > completed. I also later ran compact on each node and some it finished
>> > but
>> > some it did not. Below is the tpstats currently for the node I had to
>> > restart. Is the AE-SERVICE-STAGE the repair and compaction queued up?
>> > It
>> > seems several nodes are not getting enough free cycles to keep up. They
>> > are
>> > not timing out (30 sec timeout) for the most part but they are also not
>> > able
>> > to compact. Is this normal? Do I just give it time? I am migrating 2-3
>> > TB of
>> > data from Mysql so the load is constant and will be for days and it
>> > seems
>> > even with only 8 writer processes per node I am maxed out.
>> >
>> > Thanks for the advice. Any more pointers would be greatly appreciated.
>> >
>> > Pool Name                    Active   Pending     
Completed
>> > FILEUTILS-DELETE-POOL             0         0          
1868
>> > STREAM-STAGE                      1         1             
2
>> > RESPONSE-STAGE                    0         2     
769158645
>> > ROW-READ-STAGE                    0         0        
140942
>> > LB-OPERATIONS                     0         0             
0
>> > MESSAGE-DESERIALIZER-POOL         1         0     1470221842
>> > GMFD                              0        
0         169712
>> > LB-TARGET                         0        
0              0
>> > CONSISTENCY-MANAGER               0         0             
0
>> > ROW-MUTATION-STAGE                0         1     
865124937
>> > MESSAGE-STREAMING-POOL            0         0             
6
>> > LOAD-BALANCER-STAGE               0         0             
0
>> > FLUSH-SORTER-POOL                 0         0             
0
>> > MEMTABLE-POST-FLUSHER             0         0          
8088
>> > FLUSH-WRITER-POOL                 0         0          
8088
>> > AE-SERVICE-STAGE                  1        34            
54
>> > HINTED-HANDOFF-POOL               0         0             
7
>> >
>> >
>> >
>> > On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra <bill@dehora.net> wrote:
>> >>
>> >> On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote:
>> >>
>> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
>> >> > MessageDeserializationTask.java (line 47) dropping message
>> >> > (1,078,378ms past timeout)
>> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
>> >> > MessageDeserializationTask.java (line 47) dropping message
>> >> > (1,078,378ms past timeout)
>> >>
>> >> MESSAGE-DESERIALIZER-POOL usually backs up when other stages are bogged
>> >> downstream, (eg here's Ben Black describing the symptom when the
>> >> underlying cause is running out of disk bandwidth, well worth a watch
>> >> http://riptano.blip.tv/file/4012133/).
>> >>
>> >> Can you send all of nodetool tpstats?
>> >>
>> >> Bill
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>

Mime
View raw message