cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Yang <yy...@me.com>
Subject Re: Cassandra for numerical data set
Date Tue, 16 Aug 2011 22:24:43 GMT

Thanks Aaron.

>> 2)
>> I'm doing batch writes to the database (pulling data from multiple resources and
put them together).   I wish to know if there's some better methods to improve the writing
efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems
like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
> Have a look at http://www.datastax.com/dev/blog/bulk-loading
This is a great tool for me.   I'll try on this tool since it will require much lower bandwidth
cost and disk IO.

> 
>> 3)
>> In my case, each row is read randomly with the same chance.   I have around 0.5M
rows in total.   Can you provide some practical advices on optimizing the row cache and key
cache?   I can use up to 8 gig of memory on test machines.
> If your data set small enough to fit in memory ? . You may also be interested in the
row_cache_provider setting for column families, see the CLI help for create column family
and the IRowCacheProvider interface. You can replace the caching strategy if you want to.
 
The dataset is about 150 Gig storing as CSV and estimated as 1.3T storing as SSTable.   Hence
I don't think it can fit into memory.    I'll try the caching strategy a little bit but I
think it can improve my case a little bit.

I'm now looking into some native compression on SSTable, just patched the CASSANDRA-47 and
found there is a huge performance penalty in my use case, and I haven't figured out the reason
yet.   I suppose CASSANDRA-647 will solve it better, however I seek there's a number of tickets
working at a similar issue, including CASSANDRA-1608 etc.   Is that because cassandra really
cost a huge disk space?

Well my target is to simply get the 1.3T compressed to 700 Gig so that I can fit it into a
single server, while keeping the same level of performance.

Best,
Steve


On Aug 16, 2011, at 2:27 PM, aaron morton wrote:

>> 
> 
> Hope that helps. 
> 
>  
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 16/08/2011, at 12:44 PM, Yi Yang wrote:
> 
>> Dear all,
>> 
>> I wanna report my use case, and have a discussion with you guys.
>> 
>> I'm currently working on my second Cassandra project.   I got into somehow a unique
use case: storing traditional, relational data set into Cassandra datastore, it's a dataset
of int and float numbers, no more strings, no more other data and the column names are much
longer than the value itself.   Besides, row-key is the md-5 hash ver3 UUID of some other
data.
>> 
>> 1)
>> I did some workaround to make it save some disk space however it still takes approximately
12-15x more disk space than MySQL.   I looked into Cassandra SSTable internal, did some optimizing
on selecting better data serializer and also hashed the column name into one byte.   That
made the current database having ~6x overhead on disk space comparing with MySQL, which I
think it might be acceptable.
>> 
>> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in the
coming days.   I'll keep you updated on my testing.   But I'm willing to hear your idea on
saving disk space.
>> 
>> 2)
>> I'm doing batch writes to the database (pulling data from multiple resources and
put them together).   I wish to know if there's some better methods to improve the writing
efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems
like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
>> 
>> 3)
>> In my case, each row is read randomly with the same chance.   I have around 0.5M
rows in total.   Can you provide some practical advices on optimizing the row cache and key
cache?   I can use up to 8 gig of memory on test machines.
>> 
>> Thanks for your help.
>> 
>> 
>> Best,
>> 
>> Steve
>> 
>> 
> 


Mime
View raw message