incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Yang <>
Subject Cassandra for numerical data set
Date Tue, 16 Aug 2011 00:44:52 GMT
Dear all,

I wanna report my use case, and have a discussion with you guys.

I'm currently working on my second Cassandra project.   I got into somehow a unique use case:
storing traditional, relational data set into Cassandra datastore, it's a dataset of int and
float numbers, no more strings, no more other data and the column names are much longer than
the value itself.   Besides, row-key is the md-5 hash ver3 UUID of some other data.

I did some workaround to make it save some disk space however it still takes approximately
12-15x more disk space than MySQL.   I looked into Cassandra SSTable internal, did some optimizing
on selecting better data serializer and also hashed the column name into one byte.   That
made the current database having ~6x overhead on disk space comparing with MySQL, which I
think it might be acceptable.

I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in the coming
days.   I'll keep you updated on my testing.   But I'm willing to hear your idea on saving
disk space.

I'm doing batch writes to the database (pulling data from multiple resources and put them
together).   I wish to know if there's some better methods to improve the writing efficiency
since it's just about the same speed as MySQL, when writing sequentially.   Seems like the
commitlog requires a huge mount of disk IO comparing with my test machine can afford.

In my case, each row is read randomly with the same chance.   I have around 0.5M rows in total.
  Can you provide some practical advices on optimizing the row cache and key cache?   I can
use up to 8 gig of memory on test machines.

Thanks for your help.



View raw message