incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonny Heer <>
Subject Re: Cassandra paging, gathering stats
Date Mon, 22 Feb 2010 21:14:29 GMT

I could use df command to find the size per column family.  Although
when inserting directly into cassandra (not using StorageService) we
were collecting the following information for each column family:

Total number of keys: 59557
Total number of columns (over all keys): 2171309
Total size of column data (over all keys): 16557 KB
Total size of key data: 417 KB

This method was extremely slow (as expected because of the need to do
reads as well), and once our data set got too large we had to switch
to bulk loading.  Is there anyway to get the same information?  It
isn't a huge deal, since like you mentioned I could simply grab the
entire size of the CF.  Thanks.

By duplication I mean when i store a keyspace/cf/key/column/value i
want to record stats for only unique combination (since there is much
duplication of records that I don't care about, since cassandra is
only storing the last insert).

The issue with start/finish is that the client will have send over the
last key displayed for a given page in order to get the next page.  I
was hoping we could switch out an existing solution w/cassandra, where
the client typically passes in start (int), offset (int), page size
(int).  it does not appear there is a way to pull this off.  we are
using the ordered partitioner.

On Mon, Feb 22, 2010 at 12:07 PM, Jonathan Ellis <> wrote:
> On Mon, Feb 22, 2010 at 1:40 PM, Sonny Heer <> wrote:
> > Hey,
> >
> > We are in the process of implementing a cassandra application service.
> >
> > we have already ingested TB of data using the cassandra bulk loader
> > (StorageService).
> >
> > One of the requirements is to get a data explosion factor as a result of
> > denormalization.  Since the writes are going to the memory tables, I'm not
> > sure how I could grab stats.  I cant get size of data before ingest since
> > some of the data may be duplicated.
> Easiest way: write some known amount of data, then use nodeprobe flush
> to force it to disk.  df can tell you how much data is used, no need
> to get fancy.
> 2nd easiest: hack your client to record how much data it is sending over.
> > I was wondering if you knew of any way to do paging over all the keys for a
> > given Column family.  Or perhaps how I can read from the mem table.  I tried
> > the following ...  I'm getting 0 bytes each time.
> You're using SSTableReader locally?
> There won't be any sstables until either a memtable fills up and
> flushes on its own, or you use nodeprobe flush as described above.
> -Jonathan

View raw message