On Mon, Feb 22, 2010 at 1:40 PM, Sonny Heer <sonnyheer@gmail.com> wrote:
Hey,

We are in the process of implementing a cassandra application service.

we have already ingested TB of data using the cassandra bulk loader (StorageService).

One of the requirements is to get a data explosion factor as a result of denormalization.  Since the writes are going to the memory tables, I'm not sure how I could grab stats.  I cant get size of data before ingest since some of the data may be duplicated.

Are you talking about duplication across nodes due to the replication factor, or because some rows may still be in the memtable? 
 
I think what you want to do is bin/nodeprobe flush, bin/nodeprobe compact, wait until the system is idle and then sum the size of everything in your data paths that starts with the name of your column family.

Also a general problem we are running into is an easy way to do paging over the data set (not just rows but columns).  Looks like now the API has ways to do count, but no offset.

Columns can easily be paginated via the 'start' and 'finish' parameters.  You can't jump to a random page, but you can provide next/previous behavior.

-Brandon