incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Black...@b3k.us>
Subject Re: Node OOM Problems
Date Sun, 22 Aug 2010 17:14:26 GMT
Is the need for 10k/sec/node just for bulk loading of data or is it
how your app will operate normally?  Those are very different things.

On Sun, Aug 22, 2010 at 4:11 AM, Wayne <wav100@gmail.com> wrote:
> Currently each node has 4x1TB SATA disks. In MySQL we have 15tb currently
> with no replication. To move this to Cassandra replication factor 3 we need
> 45TB assuming the space usage is the same, but it is probably more. We had
> assumed a 30 node cluster with 4tb per node would suffice with head room for
> compaction and to growth (120 TB).
>
> SSD drives for 30 nodes in this size range are not cost feasible for us. We
> can try to use 15k SAS drives and have more spindles but then our per node
> cost goes up. I guess I naively thought cassandra would do its magic and a
> few commodity SATA hard drives would be fine.
>
> Our performance requirement does not need 10k writes/node/sec 24 hours a
> day, but if we can not get really good performance the switch from MySQL
> becomes harder to rationalize. We can currently restore from a MySQL dump a
> 2.5 terabyte backup (plain old insert statements) in 4-5 days. I expect as
> much or more from cassandra and I feel years away from simply loading 2+tb
> into cassandra without so many issues.
>
> What is really required in hardware for a 100+tb cluster with near 10k/sec
> write performance sustained? If the answer is SSD what can be expected from
> 15k SAS drives and what from SATA?
>
> Thank you for your advice, I am struggling with how to make this work. Any
> insight you can provide would be greatly appreciated.
>
>
>
> On Sun, Aug 22, 2010 at 8:58 AM, Benjamin Black <b@b3k.us> wrote:
>>
>> How much storage do you need?  240G SSDs quite capable of saturating a
>> 3Gbps SATA link are $600.  Larger ones are also available with similar
>> performance.  Perhaps you could share a bit more about the storage and
>> performance requirements.  How SSDs to sustain 10k writes/sec PER NODE
>> WITH LINEAR SCALING "breaks down the commodity server concept" eludes
>> me.
>>
>>
>> b
>>
>> On Sat, Aug 21, 2010 at 11:27 PM, Wayne <wav100@gmail.com> wrote:
>> > Thank you for the advice, I will try these settings. I am running
>> > defaults
>> > right now. The disk subsystem is one SATA disk for commitlog and 4 SATA
>> > disks in raid 0 for the data.
>> >
>> > From your email you are implying this hardware can not handle this level
>> > of
>> > sustained writes? That kind of breaks down the commodity server concept
>> > for
>> > me. I have never used anything but a 15k SAS disk (fastest disk money
>> > could
>> > buy until SSD) ALWAYS with a database. I have tried to throw out that
>> > mentality here but are you saying nothing has really changed/ Spindles
>> > spindles spindles as fast as you can afford is what I have always
>> > known...I
>> > guess that applies here? Do I need to spend $10k per node instead of
>> > $3.5k
>> > to get SUSTAINED 10k writes/sec per node?
>> >
>> >
>> >
>> > On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black <b@b3k.us> wrote:
>> >>
>> >> My guess is that you have (at least) 2 problems right now:
>> >>
>> >> You are writing 10k ops/sec to each node, but have default memtable
>> >> flush settings.  This is resulting in memtable flushing every 30
>> >> seconds (default ops flush setting is 300k).  You thus have a
>> >> proliferation of tiny sstables and are seeing minor compactions
>> >> triggered every couple of minutes.
>> >>
>> >> You have started a major compaction which is now competing with those
>> >> near constant minor compactions for far too little I/O (3 SATA drives
>> >> in RAID0, perhaps?).  Normally, this would result in a massive
>> >> ballooning of your heap use as all sorts of activities (like memtable
>> >> flushes) backed up, as well.
>> >>
>> >> I suggest you increase the memtable flush ops to at least 10 (million)
>> >> if you are going to sustain that many writes/sec, along with an
>> >> increase in the flush MB to match, based on your typical bytes/write
>> >> op.  Long term, this level of write activity demands a lot faster
>> >> storage (iops and bandwidth).
>> >>
>> >>
>> >> b
>> >> On Sat, Aug 21, 2010 at 2:18 AM, Wayne <wav100@gmail.com> wrote:
>> >> > I am already running with those options. I thought maybe that is why
>> >> > they
>> >> > never get completed as they keep pushed pushed down in priority? I
am
>> >> > getting timeouts now and then but for the most part the cluster keeps
>> >> > running. Is it normal/ok for the repair and compaction to take so
>> >> > long?
>> >> > It
>> >> > has been over 12 hours since they were submitted.
>> >> >
>> >> > On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis <jbellis@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> yes, the AES is the repair.
>> >> >>
>> >> >> if you are running linux, try adding the options to reduce
>> >> >> compaction
>> >> >> priority from
>> >> >> http://wiki.apache.org/cassandra/PerformanceTuning
>> >> >>
>> >> >> On Sat, Aug 21, 2010 at 3:17 AM, Wayne <wav100@gmail.com>
wrote:
>> >> >> > I could tell from munin that the disk utilization was getting
>> >> >> > crazy
>> >> >> > high,
>> >> >> > but the strange thing is that it seemed to "stall". The
>> >> >> > utilization
>> >> >> > went
>> >> >> > way
>> >> >> > down and everything seemed to flatten out. Requests piled
up and
>> >> >> > the
>> >> >> > node
>> >> >> > was doing nothing. It did not "crash" but was left in a useless
>> >> >> > state. I
>> >> >> > do
>> >> >> > not have access to the tpstats when that occurred. Attached
is the
>> >> >> > munin
>> >> >> > chart, and you can see the flat line after Friday at noon.
>> >> >> >
>> >> >> > I have reduced the writers from 10 per to 8 per node and they
seem
>> >> >> > to
>> >> >> > be
>> >> >> > still running, but I am afraid they are barely hanging on.
I ran
>> >> >> > nodetool
>> >> >> > repair after rebooting the failed node and I do not think
the
>> >> >> > repair
>> >> >> > ever
>> >> >> > completed. I also later ran compact on each node and some
it
>> >> >> > finished
>> >> >> > but
>> >> >> > some it did not. Below is the tpstats currently for the node
I had
>> >> >> > to
>> >> >> > restart. Is the AE-SERVICE-STAGE the repair and compaction
queued
>> >> >> > up?
>> >> >> > It
>> >> >> > seems several nodes are not getting enough free cycles to
keep up.
>> >> >> > They
>> >> >> > are
>> >> >> > not timing out (30 sec timeout) for the most part but they
are
>> >> >> > also
>> >> >> > not
>> >> >> > able
>> >> >> > to compact. Is this normal? Do I just give it time? I am migrating
>> >> >> > 2-3
>> >> >> > TB of
>> >> >> > data from Mysql so the load is constant and will be for days
and
>> >> >> > it
>> >> >> > seems
>> >> >> > even with only 8 writer processes per node I am maxed out.
>> >> >> >
>> >> >> > Thanks for the advice. Any more pointers would be greatly
>> >> >> > appreciated.
>> >> >> >
>> >> >> > Pool Name                    Active  
Pending      Completed
>> >> >> > FILEUTILS-DELETE-POOL             0        
0           1868
>> >> >> > STREAM-STAGE                      1        
1              2
>> >> >> > RESPONSE-STAGE                    0        
2      769158645
>> >> >> > ROW-READ-STAGE                    0        
0         140942
>> >> >> > LB-OPERATIONS                     0        
0              0
>> >> >> > MESSAGE-DESERIALIZER-POOL         1        
0     1470221842
>> >> >> > GMFD                             
0         0         169712
>> >> >> > LB-TARGET                        
0         0              0
>> >> >> > CONSISTENCY-MANAGER               0        
0              0
>> >> >> > ROW-MUTATION-STAGE                0        
1      865124937
>> >> >> > MESSAGE-STREAMING-POOL            0        
0              6
>> >> >> > LOAD-BALANCER-STAGE               0        
0              0
>> >> >> > FLUSH-SORTER-POOL                 0        
0              0
>> >> >> > MEMTABLE-POST-FLUSHER             0        
0           8088
>> >> >> > FLUSH-WRITER-POOL                 0        
0           8088
>> >> >> > AE-SERVICE-STAGE                  1       
34             54
>> >> >> > HINTED-HANDOFF-POOL               0        
0              7
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra <bill@dehora.net>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote:
>> >> >> >>
>> >> >> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
>> >> >> >> > MessageDeserializationTask.java (line 47) dropping
message
>> >> >> >> > (1,078,378ms past timeout)
>> >> >> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
>> >> >> >> > MessageDeserializationTask.java (line 47) dropping
message
>> >> >> >> > (1,078,378ms past timeout)
>> >> >> >>
>> >> >> >> MESSAGE-DESERIALIZER-POOL usually backs up when other
stages are
>> >> >> >> bogged
>> >> >> >> downstream, (eg here's Ben Black describing the symptom
when the
>> >> >> >> underlying cause is running out of disk bandwidth, well
worth a
>> >> >> >> watch
>> >> >> >> http://riptano.blip.tv/file/4012133/).
>> >> >> >>
>> >> >> >> Can you send all of nodetool tpstats?
>> >> >> >>
>> >> >> >> Bill
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Jonathan Ellis
>> >> >> Project Chair, Apache Cassandra
>> >> >> co-founder of Riptano, the source for professional Cassandra support
>> >> >> http://riptano.com
>> >> >
>> >> >
>> >
>> >
>
>

Mime
View raw message