When the data is streamed into the cluster by the bulk loader it is compressed on the receiving end (if the target CF has compression enabled).

If you are able to reproduce this  can you create a ticket on https://issues.apache.org/jira/browse/CASSANDRA ? 


Aaron Morton
Freelance Developer

On 28/06/2012, at 10:00 PM, Andy Cobley wrote:

My (limited) experience of moving form 0.8 to 1.0 is that you do have to use rebuildsstables.  I'm guessing BlukLoading is bypassing the compression ?


On 28 Jun 2012, at 10:53, jmodha wrote:


We are migrating our Cassandra cluster from v1.0.3 to v1.1.1, the data is
migrated using SSTableLoader to an empty Cassandra cluster.

The data in the source cluster (v1.0.3) is uncompressed and the target
cluster (1.1.1) has the column family created with compression turned on.

What we are seeing is that once the data has been loaded into the target
cluster, the size is similar to the data in the source cluster. Our
expectation is that since we have turned on compression in the target
cluster, the amount of data would be reduced.

We have tried running the "rebuildsstables" nodetool command on a node after
data has been loaded and we do indeed see a huge reduction in size e.g. from
30GB to 10GB for a given column family. We were hoping to see this at the
point of loading the data in via the SSTableLoader.

Is this behaviour expected?

Do we need to run the rebuildsstables command on all nodes to actually
compress the data after it has been streamed in?


View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/BulkLoading-SSTables-and-compression-tp7580849.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

The University of Dundee is a Scottish Registered Charity, No. SC015096.