Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 57445 invoked from network); 21 Aug 2010 20:54:18 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Aug 2010 20:54:18 -0000 Received: (qmail 53190 invoked by uid 500); 21 Aug 2010 20:54:17 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 53181 invoked by uid 500); 21 Aug 2010 20:54:16 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 53173 invoked by uid 99); 21 Aug 2010 20:54:16 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Aug 2010 20:54:16 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.82.172] (HELO mail-wy0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Aug 2010 20:53:55 +0000 Received: by wyb40 with SMTP id 40so5502446wyb.31 for ; Sat, 21 Aug 2010 13:53:35 -0700 (PDT) Received: by 10.216.17.72 with SMTP id i50mr847760wei.77.1282424014836; Sat, 21 Aug 2010 13:53:34 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.3.129 with HTTP; Sat, 21 Aug 2010 13:53:14 -0700 (PDT) In-Reply-To: References: <1282341400.15256.139.camel@dehora-laptop> From: Benjamin Black Date: Sat, 21 Aug 2010 13:53:14 -0700 Message-ID: Subject: Re: Node OOM Problems To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Perhaps I missed it in one of the earlier emails, but what is your disk subsystem config? On Sat, Aug 21, 2010 at 2:18 AM, Wayne wrote: > I am already running with those options. I thought maybe that is why they > never get completed as they keep pushed pushed down in priority? I am > getting timeouts now and then but for the most part the cluster keeps > running. Is it normal/ok for the repair and compaction to take so long? I= t > has been over 12 hours since they were submitted. > > On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis wrot= e: >> >> yes, the AES is the repair. >> >> if you are running linux, try adding the options to reduce compaction >> priority from >> http://wiki.apache.org/cassandra/PerformanceTuning >> >> On Sat, Aug 21, 2010 at 3:17 AM, Wayne wrote: >> > I could tell from munin that the disk utilization was getting crazy >> > high, >> > but the strange thing is that it seemed to "stall". The utilization we= nt >> > way >> > down and everything seemed to flatten out. Requests piled up and the >> > node >> > was doing nothing. It did not "crash" but was left in a useless state.= I >> > do >> > not have access to the tpstats when that occurred. Attached is the mun= in >> > chart, and you can see the flat line after Friday at noon. >> > >> > I have reduced the writers from 10 per to 8 per node and they seem to = be >> > still running, but I am afraid they are barely hanging on. I ran >> > nodetool >> > repair after rebooting the failed node and I do not think the repair >> > ever >> > completed. I also later ran compact on each node and some it finished >> > but >> > some it did not. Below is the tpstats currently for the node I had to >> > restart. Is the AE-SERVICE-STAGE the repair and compaction queued up? >> > It >> > seems several nodes are not getting enough free cycles to keep up. The= y >> > are >> > not timing out (30 sec timeout) for the most part but they are also no= t >> > able >> > to compact. Is this normal? Do I just give it time? I am migrating 2-3 >> > TB of >> > data from Mysql so the load is constant and will be for days and it >> > seems >> > even with only 8 writer processes per node I am maxed out. >> > >> > Thanks for the advice. Any more pointers would be greatly appreciated. >> > >> > Pool Name=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Act= ive=A0=A0 Pending=A0=A0=A0=A0=A0 Completed >> > FILEUTILS-DELETE-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0= =A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 1868 >> > STREAM-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 2 >> > RESPONSE-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0=A0=A0=A0=A0=A0=A0=A0=A0 2=A0=A0=A0=A0=A0 769158645 >> > ROW-READ-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 140942 >> > LB-OPERATIONS=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0 >> > MESSAGE-DESERIALIZER-POOL=A0=A0=A0=A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0=A0= =A0=A0 0=A0=A0=A0=A0 1470221842 >> > GMFD=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 1= 69712 >> > LB-TARGET=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 0 >> > CONSISTENCY-MANAGER=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0= =A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0 >> > ROW-MUTATION-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0= =A0=A0=A0=A0=A0=A0=A0 1=A0=A0=A0=A0=A0 865124937 >> > MESSAGE-STREAMING-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0= =A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 6 >> > LOAD-BALANCER-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0= =A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0 >> > FLUSH-SORTER-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0= =A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0 >> > MEMTABLE-POST-FLUSHER=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0= =A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 8088 >> > FLUSH-WRITER-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0= =A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 8088 >> > AE-SERVICE-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 1= =A0=A0=A0=A0=A0=A0=A0 34=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 54 >> > HINTED-HANDOFF-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0= =A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 7 >> > >> > >> > >> > On Fri, Aug 20, 2010 at 11:56 PM, Bill de h=D3ra wro= te: >> >> >> >> On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote: >> >> >> >> > =A0WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 >> >> > MessageDeserializationTask.java (line 47) dropping message >> >> > (1,078,378ms past timeout) >> >> > =A0WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 >> >> > MessageDeserializationTask.java (line 47) dropping message >> >> > (1,078,378ms past timeout) >> >> >> >> MESSAGE-DESERIALIZER-POOL usually backs up when other stages are bogg= ed >> >> downstream, (eg here's Ben Black describing the symptom when the >> >> underlying cause is running out of disk bandwidth, well worth a watch >> >> http://riptano.blip.tv/file/4012133/). >> >> >> >> Can you send all of nodetool tpstats? >> >> >> >> Bill >> >> >> > >> > >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of Riptano, the source for professional Cassandra support >> http://riptano.com > >