Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from
	:mime-version:content-type:subject:date:in-reply-to:to
	:references:message-id; q=dns; s=thelastpickle.com; b=hcwakOM1k4
	L0FcQBetpTtB1jNzmX95t0WYjG2H2i55aohLPqSCQY78JqeLZzZi7/oJeq4S8Dw/
	sPIFr8IW4x0B4d+0oLgbDn09zTh7Nh4x6OeynUtKDzFy5BJNBx23G1ZxvV1oqEhG
	RJf9tnF29pmQsHqfV1ydXdHwga3vGY4ZA=
From: aaron morton <aaron@thelastpickle.com>
Mime-Version: 1.0 (Apple Message framework v1244.3)
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_3EC2CF8A-F198-4604-9DA1-DFC69A51CF73"
Subject: Re: Cassandra for numerical data set
Date: Wed, 17 Aug 2011 11:52:53 +1200
In-Reply-To: <1B1B8BB7-4D03-47EA-8197-CE90F6B470F3@me.com>
To: user@cassandra.apache.org
References: <3AD349C6-69AD-4126-86AD-35C33986C74E@me.com>
 <8E2EC6B7-1218-4149-909B-680194B94F71@thelastpickle.com>
 <1B1B8BB7-4D03-47EA-8197-CE90F6B470F3@me.com>
Message-Id: <F8EA6547-3E5A-47F4-97C3-0786E7BD36D2@thelastpickle.com>


--Apple-Mail=_3EC2CF8A-F198-4604-9DA1-DFC69A51CF73
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

>  Is that because cassandra really cost a huge disk space?
The general design approach is / has been that storage space is cheap =
and plentiful.=20

> Well my target is to simply get the 1.3T compressed to 700 Gig so that =
I can fit it into a single server, while keeping the same level of =
performance.

Not sure it's going to be possible to get the same performance from one =
machine as you would from several.=20

Cheers
=20
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2011, at 10:24 AM, Yi Yang wrote:

>=20
> Thanks Aaron.
>=20
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple =
resources and put them together).   I wish to know if there's some =
better methods to improve the writing efficiency since it's just about =
the same speed as MySQL, when writing sequentially.   Seems like the =
commitlog requires a huge mount of disk IO comparing with my test =
machine can afford.
>> Have a look at http://www.datastax.com/dev/blog/bulk-loading
> This is a great tool for me.   I'll try on this tool since it will =
require much lower bandwidth cost and disk IO.
>=20
>>=20
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have =
around 0.5M rows in total.   Can you provide some practical advices on =
optimizing the row cache and key cache?   I can use up to 8 gig of =
memory on test machines.
>> If your data set small enough to fit in memory ? . You may also be =
interested in the row_cache_provider setting for column families, see =
the CLI help for create column family and the IRowCacheProvider =
interface. You can replace the caching strategy if you want to. =20
> The dataset is about 150 Gig storing as CSV and estimated as 1.3T =
storing as SSTable.   Hence I don't think it can fit into memory.    =
I'll try the caching strategy a little bit but I think it can improve my =
case a little bit.
>=20
> I'm now looking into some native compression on SSTable, just patched =
the CASSANDRA-47 and found there is a huge performance penalty in my use =
case, and I haven't figured out the reason yet.   I suppose =
CASSANDRA-647 will solve it better, however I seek there's a number of =
tickets working at a similar issue, including CASSANDRA-1608 etc.   Is =
that because cassandra really cost a huge disk space?
>=20
> Well my target is to simply get the 1.3T compressed to 700 Gig so that =
I can fit it into a single server, while keeping the same level of =
performance.
>=20
> Best,
> Steve
>=20
>=20
> On Aug 16, 2011, at 2:27 PM, aaron morton wrote:
>=20
>>>=20
>>=20
>> Hope that helps.=20
>>=20
>> =20
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>=20
>> On 16/08/2011, at 12:44 PM, Yi Yang wrote:
>>=20
>>> Dear all,
>>>=20
>>> I wanna report my use case, and have a discussion with you guys.
>>>=20
>>> I'm currently working on my second Cassandra project.   I got into =
somehow a unique use case: storing traditional, relational data set into =
Cassandra datastore, it's a dataset of int and float numbers, no more =
strings, no more other data and the column names are much longer than =
the value itself.   Besides, row-key is the md-5 hash ver3 UUID of some =
other data.
>>>=20
>>> 1)
>>> I did some workaround to make it save some disk space however it =
still takes approximately 12-15x more disk space than MySQL.   I looked =
into Cassandra SSTable internal, did some optimizing on selecting better =
data serializer and also hashed the column name into one byte.   That =
made the current database having ~6x overhead on disk space comparing =
with MySQL, which I think it might be acceptable.
>>>=20
>>> I'm currently interested into CASSANDRA-674 and will also test =
CASSANDRA-47 in the coming days.   I'll keep you updated on my testing.  =
 But I'm willing to hear your idea on saving disk space.
>>>=20
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple =
resources and put them together).   I wish to know if there's some =
better methods to improve the writing efficiency since it's just about =
the same speed as MySQL, when writing sequentially.   Seems like the =
commitlog requires a huge mount of disk IO comparing with my test =
machine can afford.
>>>=20
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have =
around 0.5M rows in total.   Can you provide some practical advices on =
optimizing the row cache and key cache?   I can use up to 8 gig of =
memory on test machines.
>>>=20
>>> Thanks for your help.
>>>=20
>>>=20
>>> Best,
>>>=20
>>> Steve
>>>=20
>>>=20
>>=20
>=20


--Apple-Mail=_3EC2CF8A-F198-4604-9DA1-DFC69A51CF73
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><blockquote type=3D"cite"><div style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><div><div>&nbsp;Is that because cassandra really cost a huge disk =
space?</div></div></div></blockquote><div><div style=3D"word-wrap: =
break-word; -webkit-nbsp-mode: space; -webkit-line-break: =
after-white-space; "><div><div>The general design approach is / has been =
that storage space is cheap and =
plentiful.&nbsp;</div></div><div><br></div><div><blockquote =
type=3D"cite"><div style=3D"word-wrap: break-word; -webkit-nbsp-mode: =
space; -webkit-line-break: after-white-space; "><div><div>Well my target =
is to simply get the 1.3T compressed to 700 Gig so that I can fit it =
into a single server, while keeping the same level of =
performance.</div></div></div></blockquote></div><div>Not sure it's =
going to be possible to get the same performance from one machine as you =
would from =
several.&nbsp;</div><div><br></div><div>Cheers</div><div>&nbsp;</div></div=
></div><div>
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: =
0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: =
0px; -webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></div></span></div></span></span>
</div>

<br><div><div>On 17/08/2011, at 10:24 AM, Yi Yang wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><div><br></div>Thanks =
Aaron.<div><br></div><div><blockquote type=3D"cite"><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><blockquote =
type=3D"cite"><div>2)<br>I'm doing batch writes to the database (pulling =
data from multiple resources and put them together). &nbsp;&nbsp;I wish =
to know if there's some better methods to improve the writing efficiency =
since it's just about the same speed as MySQL, when writing =
sequentially. &nbsp;&nbsp;Seems like the commitlog requires a huge mount =
of disk IO comparing with my test machine can =
afford.</div></blockquote>Have a look at&nbsp;<a =
href=3D"http://www.datastax.com/dev/blog/bulk-loading">http://www.datastax=
.com/dev/blog/bulk-loading</a></div></blockquote><div>This is a great =
tool for me. &nbsp; I'll try on this tool since it will require much =
lower bandwidth cost and disk IO.</div><br><blockquote type=3D"cite"><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><div><br></div><div><blockquote =
type=3D"cite"><div>3)<br>In my case, each row is read randomly with the =
same chance. &nbsp;&nbsp;I have around 0.5M rows in total. =
&nbsp;&nbsp;Can you provide some practical advices on optimizing the row =
cache and key cache? &nbsp;&nbsp;I can use up to 8 gig of memory on test =
machines.</div></blockquote>If your data set small enough to fit in =
memory ? . You may also be interested in the row_cache_provider setting =
for column families, see the CLI help for create column family and the =
IRowCacheProvider interface. You can replace the caching strategy if you =
want to. &nbsp;</div></div></blockquote><div>The dataset is about 150 =
Gig storing as CSV and estimated as 1.3T storing as SSTable. &nbsp; =
Hence I don't think it can fit into memory. &nbsp; &nbsp;I'll try the =
caching strategy a little bit but I think it can improve my case a =
little bit.</div><div><br></div><div>I'm now looking into some native =
compression on SSTable, just patched the CASSANDRA-47 and found there is =
a huge performance penalty in my use case, and I haven't figured out the =
reason yet. &nbsp; I suppose CASSANDRA-647 will solve it better, however =
I seek there's a number of tickets working at a similar issue, including =
CASSANDRA-1608 etc. &nbsp; Is that because cassandra really cost a huge =
disk space?</div><div><br></div><div>Well my target is to simply get the =
1.3T compressed to 700 Gig so that I can fit it into a single server, =
while keeping the same level of =
performance.</div><div><br></div><div>Best,</div><div>Steve</div><div><br>=
</div><div><br></div><div><div>On Aug 16, 2011, at 2:27 PM, aaron morton =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><div style=3D"word-wrap: break-word; -webkit-nbsp-mode: =
space; -webkit-line-break: after-white-space; "><blockquote =
type=3D"cite"><div><br></div></blockquote><div><br></div><div>Hope that =
helps.&nbsp;</div><div><br></div><div>&nbsp;<br><div>
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
font-family: Helvetica; font-style: normal; font-variant: normal; =
font-weight: normal; letter-spacing: normal; line-height: normal; =
orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; =
widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; =
font-family: Helvetica; font-style: normal; font-variant: normal; =
font-weight: normal; letter-spacing: normal; line-height: normal; =
orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; =
widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; font-family: Helvetica; font-style: =
normal; font-variant: normal; font-weight: normal; letter-spacing: =
normal; line-height: normal; orphans: 2; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: =
0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com/">http://www.thelastpickle.com</a></d=
iv></div></div></span></div></span></span>
</div>

<br><div><div>On 16/08/2011, at 12:44 PM, Yi Yang wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div>Dear =
all,<br><br>I wanna report my use case, and have a discussion with you =
guys.<br><br>I'm currently working on my second Cassandra project. =
&nbsp;&nbsp;I got into somehow a unique use case: storing traditional, =
relational data set into Cassandra datastore, it's a dataset of int and =
float numbers, no more strings, no more other data and the column names =
are much longer than the value itself. &nbsp;&nbsp;Besides, row-key is =
the md-5 hash ver3 UUID of some other data.<br><br>1)<br>I did some =
workaround to make it save some disk space however it still takes =
approximately 12-15x more disk space than MySQL. &nbsp;&nbsp;I looked =
into Cassandra SSTable internal, did some optimizing on selecting better =
data serializer and also hashed the column name into one byte. =
&nbsp;&nbsp;That made the current database having ~6x overhead on disk =
space comparing with MySQL, which I think it might be =
acceptable.<br><br>I'm currently interested into CASSANDRA-674 and will =
also test CASSANDRA-47 in the coming days. &nbsp;&nbsp;I'll keep you =
updated on my testing. &nbsp;&nbsp;But I'm willing to hear your idea on =
saving disk space.<br><br>2)<br>I'm doing batch writes to the database =
(pulling data from multiple resources and put them together). =
&nbsp;&nbsp;I wish to know if there's some better methods to improve the =
writing efficiency since it's just about the same speed as MySQL, when =
writing sequentially. &nbsp;&nbsp;Seems like the commitlog requires a =
huge mount of disk IO comparing with my test machine can =
afford.<br><br>3)<br>In my case, each row is read randomly with the same =
chance. &nbsp;&nbsp;I have around 0.5M rows in total. &nbsp;&nbsp;Can =
you provide some practical advices on optimizing the row cache and key =
cache? &nbsp;&nbsp;I can use up to 8 gig of memory on test =
machines.<br><br>Thanks for your =
help.<br><br><br>Best,<br><br>Steve<br><br><br></div></blockquote></div><b=
r></div></div></blockquote></div><br></div></div></blockquote></div><br></=
body></html>=

--Apple-Mail=_3EC2CF8A-F198-4604-9DA1-DFC69A51CF73--