Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: aaron morton <aaron@thelastpickle.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_A4DDD34E-7451-481C-A62C-BDA72D93E8A9"
Message-Id: <EA14BF9D-FAB8-423E-8DBA-D9E8716FE1A8@thelastpickle.com>
Mime-Version: 1.0 (Mac OS X Mail 6.0 \(1486\))
Subject: Re: Suggestions for tuning and monitoring.
Date: Tue, 11 Sep 2012 13:57:58 +1200
References: 
 <CAN3fqkzHS1P_kHd_fqsf=k3TEFm30t8Svr-sTmAP_COb8cj2Uw@mail.gmail.com>
To: user@cassandra.apache.org
In-Reply-To: 
 <CAN3fqkzHS1P_kHd_fqsf=k3TEFm30t8Svr-sTmAP_COb8cj2Uw@mail.gmail.com>


--Apple-Mail=_A4DDD34E-7451-481C-A62C-BDA72D93E8A9
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

> It's impossible to start new connections, or impossible to send =
requests, or it just doesn't return anything when you've sent a request.
If it's totally frozen it sounds like GC. How long does it freeze for?

> Despite that, we occasionally get OOM exceptions, and nodes crashing, =
maybe a few times per month.=20
Do you have an error stack ?

> We can't find anything in the cassandra logs indicating that =
something's up
Is it logging dropped messages or high TP pending ? Are the freezes =
associated with compaction or repair running?

>  and occasionally we do bulk deletion of supercolumns in a row.
mmm, are you sending a batch mutation with lots-o-deletes ? Each row =
mutation (insert or delete) in the batch becomes a thread pool tasks. If =
you send 1,000 rows in a batch you will temporarily prevent other =
requests from being served.

> The config options we are unsure about are things like commit log =
sizes, =85.
I would try to find some indication of what's going on before tweaking. =
Have you checked iostat ?

Hope that helps.=20

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 11/09/2012, at 2:05 AM, Henrik Schr=F6der <skrolle@gmail.com> wrote:

> Hi all,
>=20
> We're running a small Cassandra cluster (v1.0.10) serving data to our =
web application, and as our traffic grows, we're starting to see some =
weird issues. The biggest of these is that sometimes, a single node =
becomes unresponsive. It's impossible to start new connections, or =
impossible to send requests, or it just doesn't return anything when =
you've sent a request. Our client library is set to retry on another =
server when this happens, and what we see then is that the request is =
usually served instantly. So it's not the case that some requests are =
very difficult, it's that sometimes a node is just "busy", and we have =
no idea why or what it's doing.
>=20
> We're using MRTG and Monit to monitor the servers, and in MRTG the =
average CPU usage is around 5%, on our quad-core Xeon servers with SSDs. =
But occassionally through Monit we can see that the 1-min load average =
goes above 4, and this usually corresponds to the above issues. Is this =
common? Does this happen to everyone else? And why the spikiness in =
load? We can't find anything in the cassandra logs indicating that =
something's up (such as a slow GC or compaction), and there's no =
corresponding traffic spike in the application either. Should we just =
add more nodes if any single one gets CPU spikes?
>=20
> Another explanation could also be that we've configured it wrong. =
We're running pretty much default config. Each node has 16G of RAM, 4GB =
of heap, no row-cache and an sizeable key-cache. Despite that, we =
occasionally get OOM exceptions, and nodes crashing, maybe a few times =
per month. Should we increase heap size? Or move to 1.1 and enable =
off-heap caching?
>=20
> We have quite a lot of traffic to the cluster. A single keyspace with =
two column families, RF=3D3, compression is enabled, and we're running =
the default size-tiered compaction.
> Column family A has 60GB of actual data, each row has a single column, =
and that column holds binary data that varies in size up to 500kB. When =
we update a row, we write a new value to this single column, effectively =
replacing that entire row. We do ~1000 updates/s, totalling ~10MB/s in =
writes.
> Column family B also has 60GB of actual data, but each row has a =
variable (~100-10000) number of supercolumns, and each supercolumn has a =
fixed number of columns with a fixed amount of data, totalling ~1kB. The =
operations we are doing on this column family is that we add =
supercolumns to rows with a rate of ~200/s, and occasionally we do bulk =
deletion of supercolumns in a row.
>=20
> The config options we are unsure about are things like commit log =
sizes, memtable flushing thresholds, commit log syncing, compaction =
throughput, etc. Are we at a point with our data size and write load =
that the defaults aren't good enough anymore? Should we stick with =
size-tiered compaction, even though our application is update-heavy?
>=20
>=20
> Many thanks,
> /Henrik


--Apple-Mail=_A4DDD34E-7451-481C-A62C-BDA72D93E8A9
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=windows-1252

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dwindows-1252"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><blockquote type=3D"cite">It's impossible to start new connections, or =
impossible to send requests, or it just doesn't return anything when =
you've sent a request.</blockquote><div>If it's totally frozen it sounds =
like GC. How long does it freeze for?</div><div><br><blockquote =
type=3D"cite">Despite that, we occasionally get OOM exceptions, and =
nodes crashing, maybe a few times per month.&nbsp;</blockquote>Do you =
have an error stack ?<div><br></div><div><blockquote type=3D"cite">We =
can't find anything in the cassandra logs indicating that something's =
up</blockquote>Is it logging dropped messages or high TP pending ? Are =
the freezes associated with compaction or repair =
running?</div><div><br></div><div><blockquote type=3D"cite">&nbsp;and =
occasionally we do bulk deletion of supercolumns in a =
row.</blockquote>mmm, are you sending a batch mutation with =
lots-o-deletes ? Each row mutation (insert or delete) in the batch =
becomes a thread pool tasks. If you send 1,000 rows in a batch you will =
temporarily prevent other requests from being =
served.</div><div><br></div><div><blockquote type=3D"cite">The config =
options we are unsure about are things like commit log =
sizes,&nbsp;=85.</blockquote>I would try to find some indication of =
what's going on before tweaking. Have you checked iostat =
?</div><div><br></div><div>Hope that helps.&nbsp;</div><div><br><div =
apple-content-edited=3D"true">
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: =
0px; text-transform: none; white-space: normal; widows: 2; word-spacing: =
0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: =
0px; -webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></div></span></div></span></div></span></span>
</div>

<br><div><div>On 11/09/2012, at 2:05 AM, Henrik Schr=F6der &lt;<a =
href=3D"mailto:skrolle@gmail.com">skrolle@gmail.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite">Hi all,<br><br>We're running a small Cassandra cluster =
(v1.0.10) serving data to our web application, and as our traffic grows, =
we're starting to see some weird issues. The biggest of these is that =
sometimes, a single node becomes unresponsive. It's impossible to start =
new connections, or impossible to send requests, or it just doesn't =
return anything when you've sent a request. Our client library is set to =
retry on another server when this happens, and what we see then is that =
the request is usually served instantly. So it's not the case that some =
requests are very difficult, it's that sometimes a node is just "busy", =
and we have no idea why or what it's doing.<br>
<br>We're using MRTG and Monit to monitor the servers, and in MRTG the =
average CPU usage is around 5%, on our quad-core Xeon servers with SSDs. =
But occassionally through Monit we can see that the 1-min load average =
goes above 4, and this usually corresponds to the above issues. Is this =
common? Does this happen to everyone else? And why the spikiness in =
load? We can't find anything in the cassandra logs indicating that =
something's up (such as a slow GC or compaction), and there's no =
corresponding traffic spike in the application either. Should we just =
add more nodes if any single one gets CPU spikes?<br>
<br>Another explanation could also be that we've configured it wrong. =
We're running pretty much default config. Each node has 16G of RAM, 4GB =
of heap, no row-cache and an sizeable key-cache. Despite that, we =
occasionally get OOM exceptions, and nodes crashing, maybe a few times =
per month. Should we increase heap size? Or move to 1.1 and enable =
off-heap caching?<br>
<br>We have quite a lot of traffic to the cluster. A single keyspace =
with two column families, RF=3D3, compression is enabled, and we're =
running the default size-tiered compaction.<br>Column family A has 60GB =
of actual data, each row has a single column, and that column holds =
binary data that varies in size up to 500kB. When we update a row, we =
write a new value to this single column, effectively replacing that =
entire row. We do ~1000 updates/s, totalling ~10MB/s in writes.<br>
Column family B also has 60GB of actual data, but each row has a =
variable (~100-10000) number of supercolumns, and each supercolumn has a =
fixed number of columns with a fixed amount of data, totalling ~1kB. The =
operations we are doing on this column family is that we add =
supercolumns to rows with a rate of ~200/s, and occasionally we do bulk =
deletion of supercolumns in a row.<br>
<br>The config options we are unsure about are things like commit log =
sizes, memtable flushing thresholds, commit log syncing, compaction =
throughput, etc. Are we at a point with our data size and write load =
that the defaults aren't good enough anymore? Should we stick with =
size-tiered compaction, even though our application is update-heavy?<br>
<br><br>Many thanks,<br>/Henrik<br>
</blockquote></div><br></div></div></body></html>=

--Apple-Mail=_A4DDD34E-7451-481C-A62C-BDA72D93E8A9--