Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of dan@ec2.dustbunnytycoon.com
 designates 184.73.189.133 as permitted sender)
From: "Dan Hendry" <dan@ec2.dustbunnytycoon.com>
To: <user@cassandra.apache.org>
Subject: Out of Memory Issues - SERIOUS
Date: Thu, 7 Oct 2010 23:32:14 -0400
Message-ID: <008a01cb6699$66d81de0$348859a0$@dustbunnytycoon.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_008B_01CB6677.DFC67DE0"
Thread-Index: ActmmWYs/e2ZqFV4TA+9ZKQjZNtBNA==
Content-Language: en-ca

This is a multi-part message in MIME format.

------=_NextPart_000_008B_01CB6677.DFC67DE0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit

There seems to have been a fair amount of discussion on memory related
issues so I apologize if this exact situation has come up before. 

 
I am currently in the process of load testing an metrics platform I have
written which uses Cassandra and I have run into some very troubling issues.
The application is writing quite heavily, about 1000-2000 updates (columns)
per second using batch mutates of 20 columns each. This is divided between
creating new rows and adding columns to a fairly limited number of existing
index rows (<30). Nearly all of these updates are read within 10 seconds and
none contain any significant amount of data (generally much less than 100
bytes of data which I specify). Initially, the test hums along nicely but
after some amount of time (1-2 hours) Cassandra crashes with an out of
memory error. Unfortunately I have not had the opportunity to watch the test
as it crashes, but it has happened in 2/2 tests.

 
This is quite annoying but the absolutely TERRIFYING behaviour is that when
I restart Cassandra, it starts replaying the commit logs then crashes with
an out of memory error again. Restart a second time, crash with OOM; it
seems to get through about 3/4 of the commit logs. Just to be absolutely
explicit, I am not trying to insert or read at this point, just recover the
previous updates. Unless somebody can suggest a way to recover the commit
logs, I have effectively lost my data. The only way I have found to recover
is wipe the data directories. It does not matter right now given that it is
only a test but this behaviour is completely unacceptable for a production
system. 

 
Here is information about the system which is probably relevant. Let me know
if any additional details about my application would help sort out this
issue:

-          Cassandra 0.7 Beta2

-          DB Machine: EC2 m1 large with the commit log directory on an ebs
and the data directory on ephemeral storage.

-          OS: Ubuntu server 10.04

-          With the exception of changing JMX settings, no memory or JVM
changes were made to options in cassandra-env.sh

-          In Cassandra.yaml, I reduced binary_memtable_throughput_in_mb to
100 in my second test to try follow the heap memory calculation formula; I
have 8 column families.

-          I am using the Sun JVM, specifically "build 1.6.0_20-b02"

-          The app is written in java and I am using the latest Pelops
library, I am sending updates at consistency level ONE and reading them at
level ALL.

 
I have been fairly impressed with Cassandra overall and given that I am
using a beta version, I don't expect fully polished behaviour. What is
unacceptable, and quite frankly nearly unbelievable, is the fact Cassandra
cant seem to recover from the error and I am loosing data.

 
Dan Hendry


------=_NextPart_000_008B_01CB6677.DFC67DE0
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns=3D"http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii">
<meta name=3DGenerator content=3D"Microsoft Word 12 (filtered medium)">
<style>
<!--
 /* Font Definitions */
 @font-face
	{font-family:Wingdings;
	panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{mso-style-priority:34;
	margin-top:0cm;
	margin-right:0cm;
	margin-bottom:0cm;
	margin-left:36.0pt;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Calibri","sans-serif";
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
	{page:WordSection1;}
 /* List Definitions */
 @list l0
	{mso-list-id:1547640123;
	mso-list-type:hybrid;
	mso-list-template-ids:-47532376 -214030942 269025283 269025285 =
269025281 269025283 269025285 269025281 269025283 269025285;}
@list l0:level1
	{mso-level-start-at:0;
	mso-level-number-format:bullet;
	mso-level-text:-;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-18.0pt;
	font-family:"Calibri","sans-serif";
	mso-fareast-font-family:Calibri;}
ol
	{margin-bottom:0cm;}
ul
	{margin-bottom:0cm;}
-->
</style>
<!--[if gte mso 9]><xml>
 <o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
 <o:shapelayout v:ext=3D"edit">
  <o:idmap v:ext=3D"edit" data=3D"1" />
 </o:shapelayout></xml><![endif]-->
</head>

<body lang=3DEN-CA link=3Dblue vlink=3Dpurple>

<div class=3DWordSection1>

<p class=3DMsoNormal>There seems to have been a fair amount of =
discussion on
memory related issues so I apologize if this exact situation has come up
before. <o:p></o:p></p>

<p class=3DMsoNormal><o:p>&nbsp;</o:p></p>

<p class=3DMsoNormal>I am currently in the process of load testing an =
metrics platform
I have written which uses Cassandra and I have run into some very =
troubling
issues. The application is writing quite heavily, about 1000-2000 =
updates (columns)
per second using batch mutates of 20 columns each. This is divided =
between creating
new rows and adding columns to a fairly limited number of existing index =
rows
(&lt;30). Nearly all of these updates are read within 10 seconds and =
none
contain any significant amount of data (generally much less than 100 =
bytes of
data which I specify). Initially, the test hums along nicely but after =
some
amount of time (1-2 hours) Cassandra crashes with an out of memory =
error. Unfortunately
I have not had the opportunity to watch the test as it crashes, but it =
has
happened in 2/2 tests.<o:p></o:p></p>

<p class=3DMsoNormal><o:p>&nbsp;</o:p></p>

<p class=3DMsoNormal>This is quite annoying but the absolutely =
TERRIFYING behaviour
is that when I restart Cassandra, it starts replaying the commit logs =
then
crashes with an out of memory error again. Restart a second time, crash =
with
OOM; it seems to get through about 3/4 of the commit logs. Just to be =
absolutely
explicit, I am not trying to insert or read at this point, just recover =
the
previous updates. Unless somebody can suggest a way to recover the =
commit logs,
I have effectively lost my data. The only way I have found to recover is =
wipe the
data directories. It does not matter right now given that it is only a =
test but
this behaviour is completely unacceptable for a production system. =
<o:p></o:p></p>

<p class=3DMsoNormal><o:p>&nbsp;</o:p></p>

<p class=3DMsoNormal>Here is information about the system which is =
probably
relevant. Let me know if any additional details about my application =
would help
sort out this issue:<o:p></o:p></p>

<p class=3DMsoListParagraph style=3D'text-indent:-18.0pt;mso-list:l0 =
level1 lfo1'><![if !supportLists]><span
style=3D'mso-list:Ignore'>-<span style=3D'font:7.0pt "Times New =
Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span><![endif]>Cassandra 0.7 Beta2<o:p></o:p></p>

<p class=3DMsoListParagraph style=3D'text-indent:-18.0pt;mso-list:l0 =
level1 lfo1'><![if !supportLists]><span
style=3D'mso-list:Ignore'>-<span style=3D'font:7.0pt "Times New =
Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span><![endif]>DB Machine: EC2 m1 large with the commit log =
directory
on an ebs and the data directory on ephemeral storage.<o:p></o:p></p>

<p class=3DMsoListParagraph style=3D'text-indent:-18.0pt;mso-list:l0 =
level1 lfo1'><![if !supportLists]><span
style=3D'mso-list:Ignore'>-<span style=3D'font:7.0pt "Times New =
Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span><![endif]>OS: Ubuntu server 10.04<o:p></o:p></p>

<p class=3DMsoListParagraph style=3D'text-indent:-18.0pt;mso-list:l0 =
level1 lfo1'><![if !supportLists]><span
style=3D'mso-list:Ignore'>-<span style=3D'font:7.0pt "Times New =
Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span><![endif]>With the exception of changing JMX settings, no =
memory
or JVM changes were made to options in cassandra-env.sh<o:p></o:p></p>

<p class=3DMsoListParagraph style=3D'text-indent:-18.0pt;mso-list:l0 =
level1 lfo1'><![if !supportLists]><span
style=3D'mso-list:Ignore'>-<span style=3D'font:7.0pt "Times New =
Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span><![endif]>In Cassandra.yaml, I reduced =
binary_memtable_throughput_in_mb
to 100 in my second test to try follow the heap memory calculation =
formula; I
have 8 column families.<o:p></o:p></p>

<p class=3DMsoListParagraph style=3D'text-indent:-18.0pt;mso-list:l0 =
level1 lfo1'><![if !supportLists]><span
style=3D'mso-list:Ignore'>-<span style=3D'font:7.0pt "Times New =
Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span><![endif]>I am using the Sun JVM, specifically =
&#8220;build
1.6.0_20-b02&#8221;<o:p></o:p></p>

<p class=3DMsoListParagraph style=3D'text-indent:-18.0pt;mso-list:l0 =
level1 lfo1'><![if !supportLists]><span
style=3D'mso-list:Ignore'>-<span style=3D'font:7.0pt "Times New =
Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span><![endif]>The app is written in java and I am using the =
latest
Pelops library, I am sending updates at consistency level ONE and =
reading them
at level ALL.<o:p></o:p></p>

<p class=3DMsoNormal><o:p>&nbsp;</o:p></p>

<p class=3DMsoNormal>I have been fairly impressed with Cassandra overall =
and given
that I am using a beta version, I don&#8217;t expect fully polished =
behaviour. What
is unacceptable, and quite frankly nearly unbelievable, is the fact =
Cassandra cant
seem to recover from the error and I am loosing data.<o:p></o:p></p>

<p class=3DMsoNormal><o:p>&nbsp;</o:p></p>

<p class=3DMsoNormal>Dan Hendry<o:p></o:p></p>

</div>

</body>

</html>

------=_NextPart_000_008B_01CB6677.DFC67DE0--