Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of some.unique.login@gmail.com
 designates 74.125.82.44 as permitted sender)
MIME-Version: 1.0
Sender: some.unique.login@gmail.com
Date: Sat, 17 Nov 2012 08:07:58 +0200
Message-ID: 
 <CABWC1-EtAMzfRDb0t-r06g515NAExhWEMGpJW8L_GeG4ZYhuDA@mail.gmail.com>
Subject: Cassandra nodes failing with OOM
From: =?KOI8-R?B?6ddhziBDz8JvzGXX?= <soboleiv@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=f46d04428c22b6980804ceaab497

--f46d04428c22b6980804ceaab497
Content-Type: text/plain; charset=ISO-8859-1

Dear Community,

advice from you needed.

We have a cluster, 1/6 nodes of which died for various reasons(3 had OOM
message).
Nodes died in groups of 3, 1, 2. No adjacent died, though we use
SimpleSnitch.

Version:         1.1.6
Hardware:      12Gb RAM / 8 cores(virtual)
Data:              40Gb/node
Nodes:           36 nodes

Keyspaces:    2(RF=3, R=W=2) + 1(OpsCenter)
CFs:                36, 2 indexes
Partitioner:      Random
Compaction:   Leveled(we don't want 2x space for housekeeping)
Caching:          Keys only

All is pretty much standard apart from the one CF receiving writes in 64K
chunks and having sstable_size_in_mb=100.
No JNA installed - this is to be fixed soon.

Checking sysstat/sar I can see 80-90% CPU idle, no anomalies in io and the
only change - network activity spiking.
All the nodes before dying had the following on logs:
> INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 StatusLogger.java (line
72) MemtablePostFlusher               1         4         0
> INFO [ScheduledTasks:1] 2012-11-15 21:35:13,540 StatusLogger.java (line
72) FlushWriter                       1         3         0
> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line
72) HintedHandoff                     1         6         0
> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line
77) CompactionManager                 5         9

GCInspector warnings were there too, they went from ~0.8 to 3Gb heap in
5-10mins.

So, could you please give me a hint on:
1. How much GCInspector warnings per hour are considered 'normal'?
2. What should be the next thing to check?
3. What are the possible failure reasons and how to prevent those?

Thank you very much in advance,
Ivan

--f46d04428c22b6980804ceaab497
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div><font color=3D"#222222" face=3D"arial, sans-serif"><span style=3D"font=
-size:12.666666984558105px;white-space:nowrap">Dear Community,=A0</span></f=
ont></div><div><font color=3D"#222222" face=3D"arial, sans-serif"><span sty=
le=3D"font-size:12.666666984558105px;white-space:nowrap"><br>
</span></font></div><div><font color=3D"#222222" face=3D"arial, sans-serif"=
><span style=3D"font-size:12.666666984558105px;white-space:nowrap">advice f=
rom you needed.=A0</span></font></div><div><span style=3D"font-size:12.6666=
66984558105px;white-space:nowrap;color:rgb(34,34,34);font-family:arial,sans=
-serif"><br>
</span></div><div><span style=3D"font-size:12.666666984558105px;white-space=
:nowrap;color:rgb(34,34,34);font-family:arial,sans-serif">We have a cluster=
, 1/6 nodes of which died for various reasons(3 had OOM message).=A0</span>=
<div>
<font color=3D"#222222" face=3D"arial, sans-serif"><span style=3D"font-size=
:12.666666984558105px;white-space:nowrap">Nodes died in groups of 3, 1, 2. =
No adjacent died, though we use SimpleSnitch.</span></font></div><div><font=
 color=3D"#222222" face=3D"arial, sans-serif"><span style=3D"font-size:12.6=
66666984558105px;white-space:nowrap"><br>
</span></font></div><div><div><font color=3D"#222222" face=3D"arial, sans-s=
erif"><span style=3D"font-size:12.666666984558105px;white-space:nowrap">Ver=
sion: =A0 =A0 =A0 =A0 1.1.6</span></font></div></div><div><span style=3D"fo=
nt-size:12.666666984558105px;white-space:nowrap;color:rgb(34,34,34);font-fa=
mily:arial,sans-serif">Hardware: =A0 =A0 =A012Gb RAM / 8 cores(virtual)</sp=
an></div>
<div><font color=3D"#222222" face=3D"arial, sans-serif"><span style=3D"font=
-size:12.666666984558105px;white-space:nowrap">Data: =A0 =A0 =A0 =A0 =A0 =
=A0 =A040Gb/node</span></font></div><div><div><font color=3D"#222222" face=
=3D"arial, sans-serif"><span style=3D"font-size:12.666666984558105px;white-=
space:nowrap">Nodes: =A0 =A0 =A0 =A0 =A0 36 nodes</span></font></div>
</div><div><font color=3D"#222222" face=3D"arial, sans-serif"><span style=
=3D"font-size:12.666666984558105px;white-space:nowrap"><br></span></font></=
div><div><span style=3D"font-size:12.666666984558105px;white-space:nowrap;c=
olor:rgb(34,34,34);font-family:arial,sans-serif">Keyspaces: =A0 =A02(RF=3D3=
, R=3DW=3D2) + 1(OpsCenter)</span></div>
<div><font color=3D"#222222" face=3D"arial, sans-serif"><span style=3D"font=
-size:12.666666984558105px;white-space:nowrap">CFs: =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A036, 2 indexes</span></font></div><div><font color=3D"#222222" face=
=3D"arial, sans-serif"><span style=3D"font-size:12.666666984558105px;white-=
space:nowrap">Partitioner: =A0 =A0 =A0Random</span></font></div>
<div><font color=3D"#222222" face=3D"arial, sans-serif"><span style=3D"font=
-size:12.666666984558105px;white-space:nowrap">Compaction: =A0 Leveled(we d=
on&#39;t want 2x space for housekeeping)</span></font></div><div><font colo=
r=3D"#222222" face=3D"arial, sans-serif"><span style=3D"font-size:12.666666=
984558105px;white-space:nowrap">Caching: =A0 =A0 =A0 =A0 =A0Keys only</span=
></font></div>
<div><font color=3D"#222222" face=3D"arial, sans-serif"><span style=3D"font=
-size:12.666666984558105px;white-space:nowrap"><br></span></font></div><div=
><font color=3D"#222222" face=3D"arial, sans-serif"><span style=3D"font-siz=
e:12.666666984558105px;white-space:nowrap">All is pretty much standard apar=
t from the one CF receiving writes in 64K chunks and having sstable_size_in=
_mb=3D100.</span></font></div>
</div><div><font color=3D"#222222" face=3D"arial, sans-serif"><span style=
=3D"font-size:12.666666984558105px;white-space:nowrap">No JNA installed - t=
his is to be fixed soon.</span></font></div><div><span style=3D"font-size:1=
2.666666984558105px;white-space:nowrap;color:rgb(34,34,34);font-family:aria=
l,sans-serif"><br>
</span></div><div><span style=3D"font-size:12.666666984558105px;white-space=
:nowrap;color:rgb(34,34,34);font-family:arial,sans-serif">Checking sysstat/=
sar I can see 80-90% CPU idle, no anomalies in io and the only change - net=
work activity spiking.=A0</span></div>
<div><font color=3D"#222222" face=3D"arial, sans-serif"><span style=3D"font=
-size:12.666666984558105px;white-space:nowrap">All the nodes before dying h=
ad the following on logs:</span></font></div><div><font color=3D"#222222" f=
ace=3D"arial, sans-serif"><span style=3D"font-size:12.666666984558105px;whi=
te-space:nowrap"><div>
&gt; INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 StatusLogger.java (lin=
e 72) MemtablePostFlusher =A0 =A0 =A0 =A0 =A0 =A0 =A0 1 =A0 =A0 =A0 =A0 4 =
=A0 =A0 =A0 =A0 0</div><div>&gt; INFO [ScheduledTasks:1] 2012-11-15 21:35:1=
3,540 StatusLogger.java (line 72) FlushWriter =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 1 =A0 =A0 =A0 =A0 3 =A0 =A0 =A0 =A0 0</div>
<div>&gt; INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java=
 (line 72) HintedHandoff =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 1 =A0 =A0 =
=A0 =A0 6 =A0 =A0 =A0 =A0 0</div><div>&gt; INFO [ScheduledTasks:1] 2012-11-=
15 21:36:32,162 StatusLogger.java (line 77) CompactionManager =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 5 =A0 =A0 =A0 =A0 9</div>
<div><br></div><div>GCInspector warnings were there too, they went from ~0.=
8 to 3Gb heap in 5-10mins.</div><div><br></div><div>So, could you please gi=
ve me a hint on:</div><div>1. How much GCInspector warnings per hour are co=
nsidered &#39;normal&#39;?</div>
<div>2. What should be the next thing to check?</div><div>3. What are the p=
ossible failure reasons and how to prevent those?</div><div><br></div><div>=
Thank you very much in advance,</div><div>Ivan</div></span></font></div>

--f46d04428c22b6980804ceaab497--