cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Black...@b3k.us>
Subject Re: Node OOM Problems
Date Sun, 22 Aug 2010 07:04:59 GMT
I see no reason to make that assumption.  Cassandra currently has no
mechanism to alternate in that manner.  At the update rate you
require, you just need more disk io (bandwidth and iops).
Alternatively, you could use a bunch more, smaller nodes with the same
SATA RAID setup so they each take many fewer writes/sec, and so can
keep with compaction.

On Sun, Aug 22, 2010 at 12:00 AM, Wayne <wav100@gmail.com> wrote:
> Due to compaction being so expensive in terms of disk resources, does it
> make more sense to have 2 data volumes instead of one? We have 4 data disks
> in raid 0, would this make more sense to be 2 x 2 disks in raid 0? That way
> the reader and writer I assume would always be a different set of spindles?
>
> On Sun, Aug 22, 2010 at 8:27 AM, Wayne <wav100@gmail.com> wrote:
>>
>> Thank you for the advice, I will try these settings. I am running defaults
>> right now. The disk subsystem is one SATA disk for commitlog and 4 SATA
>> disks in raid 0 for the data.
>>
>> From your email you are implying this hardware can not handle this level
>> of sustained writes? That kind of breaks down the commodity server concept
>> for me. I have never used anything but a 15k SAS disk (fastest disk money
>> could buy until SSD) ALWAYS with a database. I have tried to throw out that
>> mentality here but are you saying nothing has really changed/ Spindles
>> spindles spindles as fast as you can afford is what I have always known...I
>> guess that applies here? Do I need to spend $10k per node instead of $3.5k
>> to get SUSTAINED 10k writes/sec per node?
>>
>>
>>
>> On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black <b@b3k.us> wrote:
>>>
>>> My guess is that you have (at least) 2 problems right now:
>>>
>>> You are writing 10k ops/sec to each node, but have default memtable
>>> flush settings.  This is resulting in memtable flushing every 30
>>> seconds (default ops flush setting is 300k).  You thus have a
>>> proliferation of tiny sstables and are seeing minor compactions
>>> triggered every couple of minutes.
>>>
>>> You have started a major compaction which is now competing with those
>>> near constant minor compactions for far too little I/O (3 SATA drives
>>> in RAID0, perhaps?).  Normally, this would result in a massive
>>> ballooning of your heap use as all sorts of activities (like memtable
>>> flushes) backed up, as well.
>>>
>>> I suggest you increase the memtable flush ops to at least 10 (million)
>>> if you are going to sustain that many writes/sec, along with an
>>> increase in the flush MB to match, based on your typical bytes/write
>>> op.  Long term, this level of write activity demands a lot faster
>>> storage (iops and bandwidth).
>>>
>>>
>>> b
>>> On Sat, Aug 21, 2010 at 2:18 AM, Wayne <wav100@gmail.com> wrote:
>>> > I am already running with those options. I thought maybe that is why
>>> > they
>>> > never get completed as they keep pushed pushed down in priority? I am
>>> > getting timeouts now and then but for the most part the cluster keeps
>>> > running. Is it normal/ok for the repair and compaction to take so long?
>>> > It
>>> > has been over 12 hours since they were submitted.
>>> >
>>> > On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis <jbellis@gmail.com>
>>> > wrote:
>>> >>
>>> >> yes, the AES is the repair.
>>> >>
>>> >> if you are running linux, try adding the options to reduce compaction
>>> >> priority from
>>> >> http://wiki.apache.org/cassandra/PerformanceTuning
>>> >>
>>> >> On Sat, Aug 21, 2010 at 3:17 AM, Wayne <wav100@gmail.com> wrote:
>>> >> > I could tell from munin that the disk utilization was getting crazy
>>> >> > high,
>>> >> > but the strange thing is that it seemed to "stall". The utilization
>>> >> > went
>>> >> > way
>>> >> > down and everything seemed to flatten out. Requests piled up and
the
>>> >> > node
>>> >> > was doing nothing. It did not "crash" but was left in a useless
>>> >> > state. I
>>> >> > do
>>> >> > not have access to the tpstats when that occurred. Attached is
the
>>> >> > munin
>>> >> > chart, and you can see the flat line after Friday at noon.
>>> >> >
>>> >> > I have reduced the writers from 10 per to 8 per node and they seem
>>> >> > to be
>>> >> > still running, but I am afraid they are barely hanging on. I ran
>>> >> > nodetool
>>> >> > repair after rebooting the failed node and I do not think the repair
>>> >> > ever
>>> >> > completed. I also later ran compact on each node and some it
>>> >> > finished
>>> >> > but
>>> >> > some it did not. Below is the tpstats currently for the node I
had
>>> >> > to
>>> >> > restart. Is the AE-SERVICE-STAGE the repair and compaction queued
>>> >> > up?
>>> >> > It
>>> >> > seems several nodes are not getting enough free cycles to keep
up.
>>> >> > They
>>> >> > are
>>> >> > not timing out (30 sec timeout) for the most part but they are
also
>>> >> > not
>>> >> > able
>>> >> > to compact. Is this normal? Do I just give it time? I am migrating
>>> >> > 2-3
>>> >> > TB of
>>> >> > data from Mysql so the load is constant and will be for days and
it
>>> >> > seems
>>> >> > even with only 8 writer processes per node I am maxed out.
>>> >> >
>>> >> > Thanks for the advice. Any more pointers would be greatly
>>> >> > appreciated.
>>> >> >
>>> >> > Pool Name                    Active   Pending     
Completed
>>> >> > FILEUTILS-DELETE-POOL             0        
0           1868
>>> >> > STREAM-STAGE                      1        
1              2
>>> >> > RESPONSE-STAGE                    0        
2      769158645
>>> >> > ROW-READ-STAGE                    0        
0         140942
>>> >> > LB-OPERATIONS                     0        
0              0
>>> >> > MESSAGE-DESERIALIZER-POOL         1         0    
1470221842
>>> >> > GMFD                             
0         0         169712
>>> >> > LB-TARGET                         0        
0              0
>>> >> > CONSISTENCY-MANAGER               0        
0              0
>>> >> > ROW-MUTATION-STAGE                0        
1      865124937
>>> >> > MESSAGE-STREAMING-POOL            0        
0              6
>>> >> > LOAD-BALANCER-STAGE               0        
0              0
>>> >> > FLUSH-SORTER-POOL                 0        
0              0
>>> >> > MEMTABLE-POST-FLUSHER             0        
0           8088
>>> >> > FLUSH-WRITER-POOL                 0        
0           8088
>>> >> > AE-SERVICE-STAGE                  1       
34             54
>>> >> > HINTED-HANDOFF-POOL               0        
0              7
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra <bill@dehora.net>
>>> >> > wrote:
>>> >> >>
>>> >> >> On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote:
>>> >> >>
>>> >> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
>>> >> >> > MessageDeserializationTask.java (line 47) dropping message
>>> >> >> > (1,078,378ms past timeout)
>>> >> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
>>> >> >> > MessageDeserializationTask.java (line 47) dropping message
>>> >> >> > (1,078,378ms past timeout)
>>> >> >>
>>> >> >> MESSAGE-DESERIALIZER-POOL usually backs up when other stages
are
>>> >> >> bogged
>>> >> >> downstream, (eg here's Ben Black describing the symptom when
the
>>> >> >> underlying cause is running out of disk bandwidth, well worth
a
>>> >> >> watch
>>> >> >> http://riptano.blip.tv/file/4012133/).
>>> >> >>
>>> >> >> Can you send all of nodetool tpstats?
>>> >> >>
>>> >> >> Bill
>>> >> >>
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Jonathan Ellis
>>> >> Project Chair, Apache Cassandra
>>> >> co-founder of Riptano, the source for professional Cassandra support
>>> >> http://riptano.com
>>> >
>>> >
>>
>
>

Mime
View raw message