cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Cassandra for numerical data set
Date Tue, 16 Aug 2011 21:27:16 GMT
> 
> 2)
> I'm doing batch writes to the database (pulling data from multiple resources and put
them together).   I wish to know if there's some better methods to improve the writing efficiency
since it's just about the same speed as MySQL, when writing sequentially.   Seems like the
commitlog requires a huge mount of disk IO comparing with my test machine can afford.
Have a look at http://www.datastax.com/dev/blog/bulk-loading

> 3)
> In my case, each row is read randomly with the same chance.   I have around 0.5M rows
in total.   Can you provide some practical advices on optimizing the row cache and key cache?
  I can use up to 8 gig of memory on test machines.
If your data set small enough to fit in memory ? . You may also be interested in the row_cache_provider
setting for column families, see the CLI help for create column family and the IRowCacheProvider
interface. You can replace the caching strategy if you want to.  

Hope that helps. 

 
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 16/08/2011, at 12:44 PM, Yi Yang wrote:

> Dear all,
> 
> I wanna report my use case, and have a discussion with you guys.
> 
> I'm currently working on my second Cassandra project.   I got into somehow a unique use
case: storing traditional, relational data set into Cassandra datastore, it's a dataset of
int and float numbers, no more strings, no more other data and the column names are much longer
than the value itself.   Besides, row-key is the md-5 hash ver3 UUID of some other data.
> 
> 1)
> I did some workaround to make it save some disk space however it still takes approximately
12-15x more disk space than MySQL.   I looked into Cassandra SSTable internal, did some optimizing
on selecting better data serializer and also hashed the column name into one byte.   That
made the current database having ~6x overhead on disk space comparing with MySQL, which I
think it might be acceptable.
> 
> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in the coming
days.   I'll keep you updated on my testing.   But I'm willing to hear your idea on saving
disk space.
> 
> 2)
> I'm doing batch writes to the database (pulling data from multiple resources and put
them together).   I wish to know if there's some better methods to improve the writing efficiency
since it's just about the same speed as MySQL, when writing sequentially.   Seems like the
commitlog requires a huge mount of disk IO comparing with my test machine can afford.
> 
> 3)
> In my case, each row is read randomly with the same chance.   I have around 0.5M rows
in total.   Can you provide some practical advices on optimizing the row cache and key cache?
  I can use up to 8 gig of memory on test machines.
> 
> Thanks for your help.
> 
> 
> Best,
> 
> Steve
> 
> 


Mime
View raw message