Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of nick.telford@tweetmeme.com
 designates 74.125.82.172 as permitted sender)
MIME-Version: 1.0
Sender: nick.telford@tweetmeme.com
In-Reply-To: <AANLkTimCDmrHOnfDwZSLkmkOdS2TzJ0Q3ORPWdrpJqeR@mail.gmail.com>
References: <4CD1B8EA.3050706@pdf.com>
	<AANLkTik=fmr-zpdNL3P+GZAJfwCoxTVmFq+Y+97Motab@mail.gmail.com>
	<AANLkTikr57Ozc7Qot347TM6hozS0mj5566f_PQ6GXSDi@mail.gmail.com>
	<4CD1EBF9.6040507@pdf.com>
	<AANLkTimCDmrHOnfDwZSLkmkOdS2TzJ0Q3ORPWdrpJqeR@mail.gmail.com>
Date: Thu, 4 Nov 2010 10:14:42 +0000
Message-ID: <AANLkTinovrPgAcroRtECEAvgGyRn8a3rkk6KsWcRmS39@mail.gmail.com>
Subject: Re: SSD vs. HDD
From: Nick Telford <nick.telford@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001636831b0229ef110494376dcb

--001636831b0229ef110494376dcb
Content-Type: text/plain; charset=ISO-8859-1

If you're bottle-necking on read I/O making proper use of Cassandras key
cache and row cache will improve things dramatically.

A little maths using the numbers you've provided tells me that you have
about 80GB of "hot" data (data valid in a 4 hour period). That's obviously
too much to directly cache, but you can probably cache some or all of the
row keys, depending on your column distribution among keys. This will
prevent reads from having to hit the indexes for the relevant sstables -
eliminating a seek per sstable.

If you have a subset of this data that is read more than the rest, the row
cache will help you out a lot too. Have a look at your access patterns and
see if it's worthwhile caching some rows.

If you make progress using the various caches, but don't have enough memory,
I'd explore the costs of expanding the available memory compared to
switching to SSDs as I imagine it'd be cheaper and would last longer.

Finally, given your particular deletion pattern, it's probably worth looking
at 0.7 and upgrading once it is released as stable. CASSANDRA-699[1] adds
support for TTL columns that automatically expire and get removed (during
compaction) without the need for a manual deletion mechanism. Failing this,
since data older than 4 hours is no longer relevant, you should reduce your
GCGraceSeconds >= 4 hours. This will ensure deleted data is removed faster,
keeping your sstables smaller and allowing the fs cache to operate more
effectively.

1: https://issues.apache.org/jira/browse/CASSANDRA-699

On 4 November 2010 08:18, Peter Schuller <peter.schuller@infidyne.com>wrote:

> > I am having time out errors while reading.
> > I have 5 CFs but two CFs with high write/read.
> > The data is organized in time series rows, in CF1 the new rows are read
> > every 10 seconds and then the whole rows are deleted, While in CF2 the
> rows
> > are read in different time range slices and eventually deleted may be
> after
> > few hours.
>
> So the first thing to do is to confirm what the bottleneck is. If
> you're having timeouts on reads, and assuming your not doing reads of
> hot-in-cache data so fast that CPU is the bottleneck (and given that
> you ask about SSD), the hypothesis then is that you're disk bound due
> to seeking.
>
> Observe the node(s) and in particular use "iostat -x -k 1" (or an
> equivalent graph) and look at the %util and %avgqu-sz columns to
> confirm that you are indeed disk-bound. Unless you're doing large
> reads, you will likely see, on average, small reads in amounts that
> simply saturate underlying storage, %util at 100% and the avgu-sz will
> probably be approaching the level of concurrency of your read traffic.
>
> Now, assuming that is true, the question is why. So:
>
> (1) Are you continually saturating disk or just periodically?
> (2) If periodically, does the periods of saturation correlate with
> compaction being done by Cassandra (or for that matter something
> else)?
> (3) What is your data set size relative to system memory? What is your
> system memory and JVM heap size? (Relevant because it is important to
> look at how much memory the kernel will use for page caching.)
>
> As others have mentioned, the amount of reads done on disk for each
> read form the database (assuming data is not in cache) can be affected
> by how data is written (e.g., partial row writes etc). That is one
> thing that can be addressed, as is re-structuring data to allow
> reading more sequentially (if possible). That only helps along one
> dimension though - lessening, somewhat, the cost of cold reads. The
> gains may be limited and the real problem may be that you simply need
> more memory for caching and/or more IOPS from your storage (i.e., more
> disks, maybe SSD, etc).
>
> If on the other hand you're normally completely fine and you're just
> seeing periods of saturation associated with compaction, this may be
> mitigated by software improvements by possibly rate limiting reads
> and/or writes during compaction and avoiding buffer cache thrashing.
> There's a JIRA ticket for direct I/O
> (https://issues.apache.org/jira/browse/CASSANDRA-1470). I don't think
> there's a JIRA ticket for rate limiting, but I suspect, since you're
> doing time series data, that you're not storing very large values -
> and I would expect compaction to be CPU bound rather than being close
> to saturate disk.
>
> In either case, please do report back as it's interesting to figure
> out what kind of performance issues people are seeing.
>
> --
> / Peter Schuller
>

--001636831b0229ef110494376dcb
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

If you&#39;re bottle-necking on read I/O making proper use of Cassandras ke=
y cache and row cache will improve things dramatically.<div><br></div><div>=
A little maths using the numbers you&#39;ve provided tells me that you have=
 about 80GB of &quot;hot&quot; data (data valid in a 4 hour period). That&#=
39;s obviously too much to directly cache, but you can probably cache some =
or all of the row keys, depending on your column distribution among keys. T=
his will prevent reads from having to hit the indexes for the relevant ssta=
bles - eliminating a seek per sstable.</div>
<div><br></div><div>If you have a subset of this data that is read more tha=
n the rest, the row cache will help you out a lot too. Have a look at your =
access patterns and see if it&#39;s worthwhile caching some rows.</div>
<div><br></div><div>If you make progress using the various caches, but don&=
#39;t have enough memory, I&#39;d explore the costs of expanding the availa=
ble memory compared to switching to SSDs as I imagine it&#39;d be cheaper a=
nd would last longer.</div>
<div><br></div><div>Finally, given your particular deletion pattern, it&#39=
;s probably worth looking at 0.7 and upgrading once it is released as stabl=
e. CASSANDRA-699[1] adds support for TTL columns that automatically expire =
and get removed (during compaction) without the need for a manual deletion =
mechanism. Failing this, since data older than 4 hours is no longer relevan=
t, you should reduce your GCGraceSeconds &gt;=3D 4 hours. This will ensure =
deleted data is removed faster, keeping your sstables smaller and allowing =
the fs cache to operate more effectively.</div>
<div><br></div><div>1:=A0<a href=3D"https://issues.apache.org/jira/browse/C=
ASSANDRA-699">https://issues.apache.org/jira/browse/CASSANDRA-699</a></div>=
<meta http-equiv=3D"content-type" content=3D"text/html; charset=3Dutf-8"><d=
iv><br>
<div class=3D"gmail_quote">On 4 November 2010 08:18, Peter Schuller <span d=
ir=3D"ltr">&lt;<a href=3D"mailto:peter.schuller@infidyne.com">peter.schulle=
r@infidyne.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class=3D"im">&gt; I am having time out errors while reading.<br>
&gt; I have 5 CFs but two CFs with high write/read.<br>
&gt; The data is organized in time series rows, in CF1 the new rows are rea=
d<br>
&gt; every 10 seconds and then the whole rows are deleted, While in CF2 the=
 rows<br>
&gt; are read in different time range slices and eventually deleted may be =
after<br>
&gt; few hours.<br>
<br>
</div>So the first thing to do is to confirm what the bottleneck is. If<br>
you&#39;re having timeouts on reads, and assuming your not doing reads of<b=
r>
hot-in-cache data so fast that CPU is the bottleneck (and given that<br>
you ask about SSD), the hypothesis then is that you&#39;re disk bound due<b=
r>
to seeking.<br>
<br>
Observe the node(s) and in particular use &quot;iostat -x -k 1&quot; (or an=
<br>
equivalent graph) and look at the %util and %avgqu-sz columns to<br>
confirm that you are indeed disk-bound. Unless you&#39;re doing large<br>
reads, you will likely see, on average, small reads in amounts that<br>
simply saturate underlying storage, %util at 100% and the avgu-sz will<br>
probably be approaching the level of concurrency of your read traffic.<br>
<br>
Now, assuming that is true, the question is why. So:<br>
<br>
(1) Are you continually saturating disk or just periodically?<br>
(2) If periodically, does the periods of saturation correlate with<br>
compaction being done by Cassandra (or for that matter something<br>
else)?<br>
(3) What is your data set size relative to system memory? What is your<br>
system memory and JVM heap size? (Relevant because it is important to<br>
look at how much memory the kernel will use for page caching.)<br>
<br>
As others have mentioned, the amount of reads done on disk for each<br>
read form the database (assuming data is not in cache) can be affected<br>
by how data is written (e.g., partial row writes etc). That is one<br>
thing that can be addressed, as is re-structuring data to allow<br>
reading more sequentially (if possible). That only helps along one<br>
dimension though - lessening, somewhat, the cost of cold reads. The<br>
gains may be limited and the real problem may be that you simply need<br>
more memory for caching and/or more IOPS from your storage (i.e., more<br>
disks, maybe SSD, etc).<br>
<br>
If on the other hand you&#39;re normally completely fine and you&#39;re jus=
t<br>
seeing periods of saturation associated with compaction, this may be<br>
mitigated by software improvements by possibly rate limiting reads<br>
and/or writes during compaction and avoiding buffer cache thrashing.<br>
There&#39;s a JIRA ticket for direct I/O<br>
(<a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-1470" target=3D=
"_blank">https://issues.apache.org/jira/browse/CASSANDRA-1470</a>). I don&#=
39;t think<br>
there&#39;s a JIRA ticket for rate limiting, but I suspect, since you&#39;r=
e<br>
doing time series data, that you&#39;re not storing very large values -<br>
and I would expect compaction to be CPU bound rather than being close<br>
to saturate disk.<br>
<br>
In either case, please do report back as it&#39;s interesting to figure<br>
out what kind of performance issues people are seeing.<br>
<br>
--<br>
<font color=3D"#888888">/ Peter Schuller<br>
</font></blockquote></div><br></div>

--001636831b0229ef110494376dcb--