incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julie <julie.su...@nextcentury.com>
Subject Re: Cassandra disk space utilization WAY higher than I would expect
Date Wed, 07 Jul 2010 17:10:59 GMT
Peter Schuller <peter.schuller <at> infidyne.com> writes:

> > Keep in mind that there is additional data storage overhead, including 
> > timestamps and column names. Because the schema can vary from row to row, 
> > the column names are stored with each row, in addition to the data. Disk
> > space-efficiency is not a primary design goal for Cassandra.
> 
> If the row's that are 200k (or was it 100k) are not single columns but
> rather lots and lots of smaller columns, then this will be
> significant.
> 
> In addition, during compaction there is the potential for using twice
> the amount of disk in a column family (during a major compaction all
> data will at some point exist in duplicates).

I am thinking that the timestamps and column names should be included in the 
column family stats, which basically says 300,000 rows that are 100KB each=30 
GB.  My rows only have 1 column so there should only be one timestamp.  My 
column name is only 10 bytes long.  

This doesn't explain why 30 GB of data is taking up 106 GB of disk 24 hours 
after all writes have completed.  Compactions should be complete, no?

Julie



Mime
View raw message