Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A9059D428 for ; Sat, 17 Nov 2012 06:08:31 +0000 (UTC) Received: (qmail 30168 invoked by uid 500); 17 Nov 2012 06:08:28 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 29848 invoked by uid 500); 17 Nov 2012 06:08:27 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 29597 invoked by uid 99); 17 Nov 2012 06:08:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Nov 2012 06:08:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of some.unique.login@gmail.com designates 74.125.82.44 as permitted sender) Received: from [74.125.82.44] (HELO mail-wg0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Nov 2012 06:08:19 +0000 Received: by mail-wg0-f44.google.com with SMTP id dr13so1404426wgb.25 for ; Fri, 16 Nov 2012 22:07:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=Vg4tmsI7/8FKgB7bE6V3YjAwN02ALtz0pHAPkRPKviE=; b=oROmOnVWzqp/wbwpAB/LM/bGrzk84zzKIFL9pgbN+FOL9jXsDxE4nXx2FtP3DPVDhW pllrxjyqDXhxq6oY2fb97tl51gjo0t+htr1W8fWyHCSfGa18dLDTcA7jmKxYLNggEv1G lDqto7bzhm2/TrfM5eqWPlRXMVCi9luJjjxHYd43c+nnSed8FbcmkAODKd/H1ferzin3 Ber8T+Z64jU7ODbc8MSa2kYQ/UeNqEYe1UHEAHumhMTUWp0M3tAydnB98uASDf/3q7ks +duQK1BDSDoMdsTDwbVu2F64sKNOOrWcLyuEoOiuvtpBi/FRKVxqZr+CDEmyRdt02I3q 4xjA== MIME-Version: 1.0 Received: by 10.180.7.197 with SMTP id l5mr961117wia.13.1353132478470; Fri, 16 Nov 2012 22:07:58 -0800 (PST) Sender: some.unique.login@gmail.com Received: by 10.217.83.67 with HTTP; Fri, 16 Nov 2012 22:07:58 -0800 (PST) Date: Sat, 17 Nov 2012 08:07:58 +0200 X-Google-Sender-Auth: _JjFth5tJLf-_V9nNY2mmGnlJ0E Message-ID: Subject: Cassandra nodes failing with OOM From: =?KOI8-R?B?6ddhziBDz8JvzGXX?= To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=f46d04428c22b6980804ceaab497 X-Virus-Checked: Checked by ClamAV on apache.org --f46d04428c22b6980804ceaab497 Content-Type: text/plain; charset=ISO-8859-1 Dear Community, advice from you needed. We have a cluster, 1/6 nodes of which died for various reasons(3 had OOM message). Nodes died in groups of 3, 1, 2. No adjacent died, though we use SimpleSnitch. Version: 1.1.6 Hardware: 12Gb RAM / 8 cores(virtual) Data: 40Gb/node Nodes: 36 nodes Keyspaces: 2(RF=3, R=W=2) + 1(OpsCenter) CFs: 36, 2 indexes Partitioner: Random Compaction: Leveled(we don't want 2x space for housekeeping) Caching: Keys only All is pretty much standard apart from the one CF receiving writes in 64K chunks and having sstable_size_in_mb=100. No JNA installed - this is to be fixed soon. Checking sysstat/sar I can see 80-90% CPU idle, no anomalies in io and the only change - network activity spiking. All the nodes before dying had the following on logs: > INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 StatusLogger.java (line 72) MemtablePostFlusher 1 4 0 > INFO [ScheduledTasks:1] 2012-11-15 21:35:13,540 StatusLogger.java (line 72) FlushWriter 1 3 0 > INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 72) HintedHandoff 1 6 0 > INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 77) CompactionManager 5 9 GCInspector warnings were there too, they went from ~0.8 to 3Gb heap in 5-10mins. So, could you please give me a hint on: 1. How much GCInspector warnings per hour are considered 'normal'? 2. What should be the next thing to check? 3. What are the possible failure reasons and how to prevent those? Thank you very much in advance, Ivan --f46d04428c22b6980804ceaab497 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Dear Community,=A0

advice f= rom you needed.=A0

We have a cluster= , 1/6 nodes of which died for various reasons(3 had OOM message).=A0=
Nodes died in groups of 3, 1, 2. = No adjacent died, though we use SimpleSnitch.

Ver= sion: =A0 =A0 =A0 =A0 1.1.6
Hardware: =A0 =A0 =A012Gb RAM / 8 cores(virtual)
Data: =A0 =A0 =A0 =A0 =A0 = =A0 =A040Gb/node
Nodes: =A0 =A0 =A0 =A0 =A0 36 nodes

Keyspaces: =A0 =A02(RF=3D3= , R=3DW=3D2) + 1(OpsCenter)
CFs: =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A036, 2 indexes
Partitioner: =A0 =A0 =A0Random
Compaction: =A0 Leveled(we d= on't want 2x space for housekeeping)
Caching: =A0 =A0 =A0 =A0 =A0Keys only

All is pretty much standard apar= t from the one CF receiving writes in 64K chunks and having sstable_size_in= _mb=3D100.
No JNA installed - t= his is to be fixed soon.

Checking sysstat/= sar I can see 80-90% CPU idle, no anomalies in io and the only change - net= work activity spiking.=A0
All the nodes before dying h= ad the following on logs:
> INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 StatusLogger.java (lin= e 72) MemtablePostFlusher =A0 =A0 =A0 =A0 =A0 =A0 =A0 1 =A0 =A0 =A0 =A0 4 = =A0 =A0 =A0 =A0 0
> INFO [ScheduledTasks:1] 2012-11-15 21:35:1= 3,540 StatusLogger.java (line 72) FlushWriter =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 1 =A0 =A0 =A0 =A0 3 =A0 =A0 =A0 =A0 0
> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java= (line 72) HintedHandoff =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 1 =A0 =A0 = =A0 =A0 6 =A0 =A0 =A0 =A0 0
> INFO [ScheduledTasks:1] 2012-11-= 15 21:36:32,162 StatusLogger.java (line 77) CompactionManager =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 5 =A0 =A0 =A0 =A0 9

GCInspector warnings were there too, they went from ~0.= 8 to 3Gb heap in 5-10mins.

So, could you please gi= ve me a hint on:
1. How much GCInspector warnings per hour are co= nsidered 'normal'?
2. What should be the next thing to check?
3. What are the p= ossible failure reasons and how to prevent those?

= Thank you very much in advance,
Ivan
--f46d04428c22b6980804ceaab497--