cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Yang <yy...@me.com>
Subject Re: Cassandra for numerical data set
Date Tue, 16 Aug 2011 23:52:15 GMT
BTW,
If I'm going to insert a SCF row with ~400 columns and ~50 subcolumns under each column, how
often should I do a mutation? per column or per row?


On Aug 16, 2011, at 3:24 PM, Yi Yang wrote:

> 
> Thanks Aaron.
> 
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple resources
and put them together).   I wish to know if there's some better methods to improve the writing
efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems
like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
>> Have a look at http://www.datastax.com/dev/blog/bulk-loading
> This is a great tool for me.   I'll try on this tool since it will require much lower
bandwidth cost and disk IO.
> 
>> 
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have around 0.5M
rows in total.   Can you provide some practical advices on optimizing the row cache and key
cache?   I can use up to 8 gig of memory on test machines.
>> If your data set small enough to fit in memory ? . You may also be interested in
the row_cache_provider setting for column families, see the CLI help for create column family
and the IRowCacheProvider interface. You can replace the caching strategy if you want to.
 
> The dataset is about 150 Gig storing as CSV and estimated as 1.3T storing as SSTable.
  Hence I don't think it can fit into memory.    I'll try the caching strategy a little bit
but I think it can improve my case a little bit.
> 
> I'm now looking into some native compression on SSTable, just patched the CASSANDRA-47
and found there is a huge performance penalty in my use case, and I haven't figured out the
reason yet.   I suppose CASSANDRA-647 will solve it better, however I seek there's a number
of tickets working at a similar issue, including CASSANDRA-1608 etc.   Is that because cassandra
really cost a huge disk space?
> 
> Well my target is to simply get the 1.3T compressed to 700 Gig so that I can fit it into
a single server, while keeping the same level of performance.
> 
> Best,
> Steve
> 
> 
> On Aug 16, 2011, at 2:27 PM, aaron morton wrote:
> 
>>> 
>> 
>> Hope that helps. 
>> 
>>  
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 16/08/2011, at 12:44 PM, Yi Yang wrote:
>> 
>>> Dear all,
>>> 
>>> I wanna report my use case, and have a discussion with you guys.
>>> 
>>> I'm currently working on my second Cassandra project.   I got into somehow a
unique use case: storing traditional, relational data set into Cassandra datastore, it's a
dataset of int and float numbers, no more strings, no more other data and the column names
are much longer than the value itself.   Besides, row-key is the md-5 hash ver3 UUID of some
other data.
>>> 
>>> 1)
>>> I did some workaround to make it save some disk space however it still takes
approximately 12-15x more disk space than MySQL.   I looked into Cassandra SSTable internal,
did some optimizing on selecting better data serializer and also hashed the column name into
one byte.   That made the current database having ~6x overhead on disk space comparing with
MySQL, which I think it might be acceptable.
>>> 
>>> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in
the coming days.   I'll keep you updated on my testing.   But I'm willing to hear your idea
on saving disk space.
>>> 
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple resources
and put them together).   I wish to know if there's some better methods to improve the writing
efficiency since it's just about the same speed as MySQL, when writing sequentially.   Seems
like the commitlog requires a huge mount of disk IO comparing with my test machine can afford.
>>> 
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have around 0.5M
rows in total.   Can you provide some practical advices on optimizing the row cache and key
cache?   I can use up to 8 gig of memory on test machines.
>>> 
>>> Thanks for your help.
>>> 
>>> 
>>> Best,
>>> 
>>> Steve
>>> 
>>> 
>> 
> 


Mime
View raw message