cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wayne <wav...@gmail.com>
Subject Re: Node OOM Problems
Date Sun, 22 Aug 2010 07:00:55 GMT
Due to compaction being so expensive in terms of disk resources, does it
make more sense to have 2 data volumes instead of one? We have 4 data disks
in raid 0, would this make more sense to be 2 x 2 disks in raid 0? That way
the reader and writer I assume would always be a different set of spindles?

On Sun, Aug 22, 2010 at 8:27 AM, Wayne <wav100@gmail.com> wrote:

> Thank you for the advice, I will try these settings. I am running defaults
> right now. The disk subsystem is one SATA disk for commitlog and 4 SATA
> disks in raid 0 for the data.
>
> From your email you are implying this hardware can not handle this level of
> sustained writes? That kind of breaks down the commodity server concept for
> me. I have never used anything but a 15k SAS disk (fastest disk money could
> buy until SSD) ALWAYS with a database. I have tried to throw out that
> mentality here but are you saying nothing has really changed/ Spindles
> spindles spindles as fast as you can afford is what I have always known...I
> guess that applies here? Do I need to spend $10k per node instead of $3.5k
> to get SUSTAINED 10k writes/sec per node?
>
>
>
>
> On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black <b@b3k.us> wrote:
>
>> My guess is that you have (at least) 2 problems right now:
>>
>> You are writing 10k ops/sec to each node, but have default memtable
>> flush settings.  This is resulting in memtable flushing every 30
>> seconds (default ops flush setting is 300k).  You thus have a
>> proliferation of tiny sstables and are seeing minor compactions
>> triggered every couple of minutes.
>>
>> You have started a major compaction which is now competing with those
>> near constant minor compactions for far too little I/O (3 SATA drives
>> in RAID0, perhaps?).  Normally, this would result in a massive
>> ballooning of your heap use as all sorts of activities (like memtable
>> flushes) backed up, as well.
>>
>> I suggest you increase the memtable flush ops to at least 10 (million)
>> if you are going to sustain that many writes/sec, along with an
>> increase in the flush MB to match, based on your typical bytes/write
>> op.  Long term, this level of write activity demands a lot faster
>> storage (iops and bandwidth).
>>
>>
>> b
>> On Sat, Aug 21, 2010 at 2:18 AM, Wayne <wav100@gmail.com> wrote:
>> > I am already running with those options. I thought maybe that is why
>> they
>> > never get completed as they keep pushed pushed down in priority? I am
>> > getting timeouts now and then but for the most part the cluster keeps
>> > running. Is it normal/ok for the repair and compaction to take so long?
>> It
>> > has been over 12 hours since they were submitted.
>> >
>> > On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis <jbellis@gmail.com>
>> wrote:
>> >>
>> >> yes, the AES is the repair.
>> >>
>> >> if you are running linux, try adding the options to reduce compaction
>> >> priority from
>> >> http://wiki.apache.org/cassandra/PerformanceTuning
>> >>
>> >> On Sat, Aug 21, 2010 at 3:17 AM, Wayne <wav100@gmail.com> wrote:
>> >> > I could tell from munin that the disk utilization was getting crazy
>> >> > high,
>> >> > but the strange thing is that it seemed to "stall". The utilization
>> went
>> >> > way
>> >> > down and everything seemed to flatten out. Requests piled up and the
>> >> > node
>> >> > was doing nothing. It did not "crash" but was left in a useless
>> state. I
>> >> > do
>> >> > not have access to the tpstats when that occurred. Attached is the
>> munin
>> >> > chart, and you can see the flat line after Friday at noon.
>> >> >
>> >> > I have reduced the writers from 10 per to 8 per node and they seem
to
>> be
>> >> > still running, but I am afraid they are barely hanging on. I ran
>> >> > nodetool
>> >> > repair after rebooting the failed node and I do not think the repair
>> >> > ever
>> >> > completed. I also later ran compact on each node and some it finished
>> >> > but
>> >> > some it did not. Below is the tpstats currently for the node I had
to
>> >> > restart. Is the AE-SERVICE-STAGE the repair and compaction queued up?
>> >> > It
>> >> > seems several nodes are not getting enough free cycles to keep up.
>> They
>> >> > are
>> >> > not timing out (30 sec timeout) for the most part but they are also
>> not
>> >> > able
>> >> > to compact. Is this normal? Do I just give it time? I am migrating
>> 2-3
>> >> > TB of
>> >> > data from Mysql so the load is constant and will be for days and it
>> >> > seems
>> >> > even with only 8 writer processes per node I am maxed out.
>> >> >
>> >> > Thanks for the advice. Any more pointers would be greatly
>> appreciated.
>> >> >
>> >> > Pool Name                    Active   Pending      Completed
>> >> > FILEUTILS-DELETE-POOL             0         0           1868
>> >> > STREAM-STAGE                      1         1              2
>> >> > RESPONSE-STAGE                    0         2      769158645
>> >> > ROW-READ-STAGE                    0         0         140942
>> >> > LB-OPERATIONS                     0         0              0
>> >> > MESSAGE-DESERIALIZER-POOL         1         0     1470221842
>> >> > GMFD                              0         0         169712
>> >> > LB-TARGET                         0         0              0
>> >> > CONSISTENCY-MANAGER               0         0              0
>> >> > ROW-MUTATION-STAGE                0         1      865124937
>> >> > MESSAGE-STREAMING-POOL            0         0              6
>> >> > LOAD-BALANCER-STAGE               0         0              0
>> >> > FLUSH-SORTER-POOL                 0         0              0
>> >> > MEMTABLE-POST-FLUSHER             0         0           8088
>> >> > FLUSH-WRITER-POOL                 0         0           8088
>> >> > AE-SERVICE-STAGE                  1        34             54
>> >> > HINTED-HANDOFF-POOL               0         0              7
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra <bill@dehora.net>
>> wrote:
>> >> >>
>> >> >> On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote:
>> >> >>
>> >> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
>> >> >> > MessageDeserializationTask.java (line 47) dropping message
>> >> >> > (1,078,378ms past timeout)
>> >> >> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
>> >> >> > MessageDeserializationTask.java (line 47) dropping message
>> >> >> > (1,078,378ms past timeout)
>> >> >>
>> >> >> MESSAGE-DESERIALIZER-POOL usually backs up when other stages are
>> bogged
>> >> >> downstream, (eg here's Ben Black describing the symptom when the
>> >> >> underlying cause is running out of disk bandwidth, well worth a
>> watch
>> >> >> http://riptano.blip.tv/file/4012133/).
>> >> >>
>> >> >> Can you send all of nodetool tpstats?
>> >> >>
>> >> >> Bill
>> >> >>
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jonathan Ellis
>> >> Project Chair, Apache Cassandra
>> >> co-founder of Riptano, the source for professional Cassandra support
>> >> http://riptano.com
>> >
>> >
>>
>
>

Mime
View raw message