Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: error (athena.apache.org: local policy)
From: Keith Wright <kwright@nanigans.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Thu, 16 May 2013 09:14:04 -0500
Subject: SSTable size versus read performance
Thread-Topic: SSTable size versus read performance
Thread-Index: Ac5SP6BQLHXjL7D/RxOUziRUODv1pQ==
Message-ID: <CDBA61EC.E018%kwright@nanigans.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.2.3.120616
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_CDBA61ECE018kwrightnaniganscom_"
MIME-Version: 1.0

--_000_CDBA61ECE018kwrightnaniganscom_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi all,

    I currently have 2 clusters, one running on 1.1.10 using CQL2 and one r=
unning on 1.2.4 using CQL3 and Vnodes.   The machines in the 1.2.4 cluster =
are expected to have better IO performance as we are going from 1 SSD data =
disk per node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cl=
uster with higher end drives (commit logs are on their own disk shared with=
 the OS).  I am doing some stress testing on the 1.2 cluster and have found=
 that although the reads / sec as seen from iostat are approximately the sa=
me (3K / sec) in both clusters, the MB/s read in the new cluster is MUCH hi=
gher (7 MB/s in 1.1 as compared to 30-50 MB/s in 1.2).  As a result, I am s=
eeing excessive iowait in the 1.2 cluster causing high average read times o=
f 30 ms under the same load (1.1 cluster sees around 5 ms).  They are both =
using Leveled compaction but one thing I did change in the new cluster was =
to increase the sstable size from the OOTB setting to 32 MB.  Note that my =
reads are by definition highly random as we are running memcached in front =
for various reasons.  Does cassandra need to read the entire SSTable when f=
etching a row or only the relevant chunk (I have the OOTB chunk size and BF=
 settings)?  I just decreased the sstable size to 5 MB and am waiting for c=
ompactions to complete to see if that makes a difference.

Thanks!

Relevant table definition if helpful (note that I also changed to the LZ4 c=
ompressor expecting better read performance and I decreased the crc change =
again to minimize read latency):

CREATE TABLE global_user (
user_id BIGINT,
app_id INT,
type TEXT,
name TEXT,
last TIMESTAMP,
paid BOOLEAN,
values map<TIMESTAMP,FLOAT>,
sku_time map<TEXT,TIMESTAMP>,
extra_param map<TEXT,TEXT>,
PRIMARY KEY (user_id, app_id, type, name)
) with compression=3D{'crc_check_chance':0.1,'sstable_compression':'LZ4Comp=
ressor'} and
compaction=3D{'class':'LeveledCompactionStrategy'} and
compaction_strategy_options =3D {'sstable_size_in_mb':5} and
gc_grace_seconds =3D 86400;

--_000_CDBA61ECE018kwrightnaniganscom_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html><head></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode:=
 space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-si=
ze: 14px; font-family: Calibri, sans-serif; "><div>Hi all,</div><div><br></=
div><div>&nbsp; &nbsp; I currently have 2 clusters, one running on 1.1.10 u=
sing CQL2 and one running on 1.2.4 using CQL3 and Vnodes. &nbsp; The machin=
es in the 1.2.4 cluster are expected to have better IO performance as we ar=
e going from 1 SSD data disk per node in the 1.1 cluster to 3 SSD data disk=
s per node in the 1.2 cluster with higher end drives (commit logs are on th=
eir own disk shared with the OS). &nbsp;I am doing some stress testing on t=
he 1.2 cluster and have found that although the reads / sec as seen from io=
stat are approximately the same (3K / sec) in both clusters, the MB/s read =
in the new cluster is MUCH higher (7 MB/s in 1.1 as compared to 30-50 MB/s =
in 1.2). &nbsp;As a result, I am seeing excessive iowait in the 1.2 cluster=
 causing high average read times of 30 ms under the same load (1.1 cluster =
sees around 5 ms). &nbsp;They are both using Leveled compaction but one thi=
ng I did change in the new cluster was to increase the sstable size from th=
e OOTB setting to 32 MB. &nbsp;Note that my reads are by definition highly =
random as we are running memcached in front for various reasons. &nbsp;Does=
 cassandra need to read the entire SSTable when fetching a row or only the =
relevant chunk (I have the OOTB chunk size and BF settings)? &nbsp;I just d=
ecreased the sstable size to 5 MB and am waiting for compactions to complet=
e to see if that makes a difference. &nbsp;</div><div><br></div><div>Thanks=
!</div><div><br></div><div>Relevant table definition if helpful (note that =
I also changed to the LZ4 compressor expecting better read performance and =
I decreased the crc change again to minimize read latency):</div><div><br><=
/div><div><div style=3D"font-family: Noteworthy-Light; font-size: 15px; ">C=
REATE TABLE global_user (</div><div style=3D"font-family: Noteworthy-Light;=
 font-size: 15px; "><span class=3D"Apple-tab-span" style=3D"white-space: pr=
e; ">	</span>user_id&nbsp;BIGINT,</div><div style=3D"font-family: Noteworth=
y-Light; font-size: 15px; "><span class=3D"Apple-tab-span" style=3D"white-s=
pace: pre; ">	</span>app_id INT,</div><div style=3D"font-family: Noteworthy=
-Light; font-size: 15px; "><span class=3D"Apple-tab-span" style=3D"white-sp=
ace: pre; ">	</span>type TEXT,</div><div style=3D"font-family: Noteworthy-L=
ight; font-size: 15px; "><span class=3D"Apple-tab-span" style=3D"white-spac=
e: pre; ">	</span>name TEXT,</div><div style=3D"font-family: Noteworthy-Lig=
ht; font-size: 15px; "><span class=3D"Apple-tab-span" style=3D"white-space:=
 pre; ">	</span>last TIMESTAMP,</div><div style=3D"font-family: Noteworthy-=
Light; font-size: 15px; "><span class=3D"Apple-tab-span" style=3D"white-spa=
ce: pre; ">	</span>paid BOOLEAN,</div><div style=3D"font-family: Noteworthy=
-Light; font-size: 15px; "><span class=3D"Apple-tab-span" style=3D"white-sp=
ace: pre; ">	</span>values map&lt;TIMESTAMP,FLOAT&gt;,<span class=3D"Apple-=
tab-span" style=3D"white-space: pre; ">		</span><span class=3D"Apple-tab-sp=
an" style=3D"white-space: pre; ">	</span></div><div style=3D"font-family: N=
oteworthy-Light; font-size: 15px; "><span class=3D"Apple-tab-span" style=3D=
"white-space: pre; ">	</span>sku_time map&lt;TEXT,TIMESTAMP&gt;,<span class=
=3D"Apple-tab-span" style=3D"white-space: pre; ">		</span></div><div style=
=3D"font-family: Noteworthy-Light; font-size: 15px; "><span class=3D"Apple-=
tab-span" style=3D"white-space: pre; ">	</span>extra_param map&lt;TEXT,TEXT=
&gt;,<span class=3D"Apple-tab-span" style=3D"white-space: pre; ">			</span>=
</div><div style=3D"font-family: Noteworthy-Light; font-size: 15px; "><span=
 class=3D"Apple-tab-span" style=3D"white-space: pre; ">	</span>PRIMARY KEY =
(user_id, app_id, type, name)</div><div style=3D"font-family: Noteworthy-Li=
ght; font-size: 15px; ">)&nbsp;with&nbsp;compression=3D{'crc_check_chance':=
0.1,'sstable_compression':'LZ4Compressor'} and&nbsp;</div><div style=3D"fon=
t-family: Noteworthy-Light; font-size: 15px; ">compaction=3D{'class':'Level=
edCompactionStrategy'} and&nbsp;</div><div style=3D"font-family: Noteworthy=
-Light; font-size: 15px; ">compaction_strategy_options =3D {'sstable_size_i=
n_mb':5}&nbsp;and&nbsp;</div><div style=3D"font-family: Noteworthy-Light; f=
ont-size: 15px; ">gc_grace_seconds =3D 86400;</div></div></body></html>

--_000_CDBA61ECE018kwrightnaniganscom_--