Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: softfail (nike.apache.org: transitioning domain of yyang@me.com
 does not designate 17.158.233.225 as permitted sender)
MIME-version: 1.0
Content-type: multipart/alternative;
 boundary="Boundary_(ID_Wi9ZndHVjxBz0KDzM/JNeg)"
From: Yi Yang <yyang@me.com>
Subject: Re: Cassandra for numerical data set
Date: Tue, 16 Aug 2011 16:52:15 -0700
In-reply-to: <1B1B8BB7-4D03-47EA-8197-CE90F6B470F3@me.com>
To: user@cassandra.apache.org
References: <3AD349C6-69AD-4126-86AD-35C33986C74E@me.com>
 <8E2EC6B7-1218-4149-909B-680194B94F71@thelastpickle.com>
 <1B1B8BB7-4D03-47EA-8197-CE90F6B470F3@me.com>
Message-id: <C8DB51FF-15E0-48CE-ACDA-68BACBDA6CAC@me.com>


--Boundary_(ID_Wi9ZndHVjxBz0KDzM/JNeg)
Content-type: text/plain; CHARSET=US-ASCII
Content-transfer-encoding: 7BIT

BTW,
If I'm going to insert a SCF row with ~400 columns and ~50 subcolumns under each column, how often should I do a mutation? per column or per row?


On Aug 16, 2011, at 3:24 PM, Yi Yang wrote:

> 
> Thanks Aaron.
> 
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple resources and put them together).   I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
>> Have a look at http://www.datastax.com/dev/blog/bulk-loading
> This is a great tool for me.   I'll try on this tool since it will require much lower bandwidth cost and disk IO.
> 
>> 
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.   Can you provide some practical advices on optimizing the row cache and key cache?   I can use up to 8 gig of memory on test machines.
>> If your data set small enough to fit in memory ? . You may also be interested in the row_cache_provider setting for column families, see the CLI help for create column family and the IRowCacheProvider interface. You can replace the caching strategy if you want to.  
> The dataset is about 150 Gig storing as CSV and estimated as 1.3T storing as SSTable.   Hence I don't think it can fit into memory.    I'll try the caching strategy a little bit but I think it can improve my case a little bit.
> 
> I'm now looking into some native compression on SSTable, just patched the CASSANDRA-47 and found there is a huge performance penalty in my use case, and I haven't figured out the reason yet.   I suppose CASSANDRA-647 will solve it better, however I seek there's a number of tickets working at a similar issue, including CASSANDRA-1608 etc.   Is that because cassandra really cost a huge disk space?
> 
> Well my target is to simply get the 1.3T compressed to 700 Gig so that I can fit it into a single server, while keeping the same level of performance.
> 
> Best,
> Steve
> 
> 
> On Aug 16, 2011, at 2:27 PM, aaron morton wrote:
> 
>>> 
>> 
>> Hope that helps. 
>> 
>>  
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 16/08/2011, at 12:44 PM, Yi Yang wrote:
>> 
>>> Dear all,
>>> 
>>> I wanna report my use case, and have a discussion with you guys.
>>> 
>>> I'm currently working on my second Cassandra project.   I got into somehow a unique use case: storing traditional, relational data set into Cassandra datastore, it's a dataset of int and float numbers, no more strings, no more other data and the column names are much longer than the value itself.   Besides, row-key is the md-5 hash ver3 UUID of some other data.
>>> 
>>> 1)
>>> I did some workaround to make it save some disk space however it still takes approximately 12-15x more disk space than MySQL.   I looked into Cassandra SSTable internal, did some optimizing on selecting better data serializer and also hashed the column name into one byte.   That made the current database having ~6x overhead on disk space comparing with MySQL, which I think it might be acceptable.
>>> 
>>> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in the coming days.   I'll keep you updated on my testing.   But I'm willing to hear your idea on saving disk space.
>>> 
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple resources and put them together).   I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
>>> 
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.   Can you provide some practical advices on optimizing the row cache and key cache?   I can use up to 8 gig of memory on test machines.
>>> 
>>> Thanks for your help.
>>> 
>>> 
>>> Best,
>>> 
>>> Steve
>>> 
>>> 
>> 
> 


--Boundary_(ID_Wi9ZndHVjxBz0KDzM/JNeg)
Content-type: text/html; CHARSET=US-ASCII
Content-transfer-encoding: quoted-printable

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
">BTW,<div>If I'm going to insert a SCF row with ~400 columns and ~50 =
subcolumns under each column, how often should I do a mutation? per =
column or per row?</div><div><br></div><div><br><div><div>On Aug 16, =
2011, at 3:24 PM, Yi Yang wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><div><br></div>Thanks =
Aaron.<div><br></div><div><blockquote type=3D"cite"><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><blockquote =
type=3D"cite"><div>2)<br>I'm doing batch writes to the database (pulling =
data from multiple resources and put them together). &nbsp;&nbsp;I wish =
to know if there's some better methods to improve the writing efficiency =
since it's just about the same speed as MySQL, when writing =
sequentially. &nbsp;&nbsp;Seems like the commitlog requires a huge mount =
of disk IO comparing with my test machine can =
afford.</div></blockquote>Have a look at&nbsp;<a =
href=3D"http://www.datastax.com/dev/blog/bulk-loading">http://www.datastax=
.com/dev/blog/bulk-loading</a></div></blockquote><div>This is a great =
tool for me. &nbsp; I'll try on this tool since it will require much =
lower bandwidth cost and disk IO.</div><br><blockquote type=3D"cite"><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><div><br></div><div><blockquote =
type=3D"cite"><div>3)<br>In my case, each row is read randomly with the =
same chance. &nbsp;&nbsp;I have around 0.5M rows in total. =
&nbsp;&nbsp;Can you provide some practical advices on optimizing the row =
cache and key cache? &nbsp;&nbsp;I can use up to 8 gig of memory on test =
machines.</div></blockquote>If your data set small enough to fit in =
memory ? . You may also be interested in the row_cache_provider setting =
for column families, see the CLI help for create column family and the =
IRowCacheProvider interface. You can replace the caching strategy if you =
want to. &nbsp;</div></div></blockquote><div>The dataset is about 150 =
Gig storing as CSV and estimated as 1.3T storing as SSTable. &nbsp; =
Hence I don't think it can fit into memory. &nbsp; &nbsp;I'll try the =
caching strategy a little bit but I think it can improve my case a =
little bit.</div><div><br></div><div>I'm now looking into some native =
compression on SSTable, just patched the CASSANDRA-47 and found there is =
a huge performance penalty in my use case, and I haven't figured out the =
reason yet. &nbsp; I suppose CASSANDRA-647 will solve it better, however =
I seek there's a number of tickets working at a similar issue, including =
CASSANDRA-1608 etc. &nbsp; Is that because cassandra really cost a huge =
disk space?</div><div><br></div><div>Well my target is to simply get the =
1.3T compressed to 700 Gig so that I can fit it into a single server, =
while keeping the same level of =
performance.</div><div><br></div><div>Best,</div><div>Steve</div><div><br>=
</div><div><br></div><div><div>On Aug 16, 2011, at 2:27 PM, aaron morton =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><div style=3D"word-wrap: break-word; -webkit-nbsp-mode: =
space; -webkit-line-break: after-white-space; "><blockquote =
type=3D"cite"><div><br></div></blockquote><div><br></div><div>Hope that =
helps.&nbsp;</div><div><br></div><div>&nbsp;<br><div>
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
font-family: Helvetica; font-style: normal; font-variant: normal; =
font-weight: normal; letter-spacing: normal; line-height: normal; =
orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; =
widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; =
font-family: Helvetica; font-style: normal; font-variant: normal; =
font-weight: normal; letter-spacing: normal; line-height: normal; =
orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; =
widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; font-family: Helvetica; font-style: =
normal; font-variant: normal; font-weight: normal; letter-spacing: =
normal; line-height: normal; orphans: 2; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: =
0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com/">http://www.thelastpickle.com</a></d=
iv></div></div></span></div></span></span>
</div>

<br><div><div>On 16/08/2011, at 12:44 PM, Yi Yang wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div>Dear =
all,<br><br>I wanna report my use case, and have a discussion with you =
guys.<br><br>I'm currently working on my second Cassandra project. =
&nbsp;&nbsp;I got into somehow a unique use case: storing traditional, =
relational data set into Cassandra datastore, it's a dataset of int and =
float numbers, no more strings, no more other data and the column names =
are much longer than the value itself. &nbsp;&nbsp;Besides, row-key is =
the md-5 hash ver3 UUID of some other data.<br><br>1)<br>I did some =
workaround to make it save some disk space however it still takes =
approximately 12-15x more disk space than MySQL. &nbsp;&nbsp;I looked =
into Cassandra SSTable internal, did some optimizing on selecting better =
data serializer and also hashed the column name into one byte. =
&nbsp;&nbsp;That made the current database having ~6x overhead on disk =
space comparing with MySQL, which I think it might be =
acceptable.<br><br>I'm currently interested into CASSANDRA-674 and will =
also test CASSANDRA-47 in the coming days. &nbsp;&nbsp;I'll keep you =
updated on my testing. &nbsp;&nbsp;But I'm willing to hear your idea on =
saving disk space.<br><br>2)<br>I'm doing batch writes to the database =
(pulling data from multiple resources and put them together). =
&nbsp;&nbsp;I wish to know if there's some better methods to improve the =
writing efficiency since it's just about the same speed as MySQL, when =
writing sequentially. &nbsp;&nbsp;Seems like the commitlog requires a =
huge mount of disk IO comparing with my test machine can =
afford.<br><br>3)<br>In my case, each row is read randomly with the same =
chance. &nbsp;&nbsp;I have around 0.5M rows in total. &nbsp;&nbsp;Can =
you provide some practical advices on optimizing the row cache and key =
cache? &nbsp;&nbsp;I can use up to 8 gig of memory on test =
machines.<br><br>Thanks for your =
help.<br><br><br>Best,<br><br>Steve<br><br><br></div></blockquote></div><b=
r></div></div></blockquote></div><br></div></div></blockquote></div><br></=
div></body></html>=

--Boundary_(ID_Wi9ZndHVjxBz0KDzM/JNeg)--