incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonny Heer <sonnyh...@gmail.com>
Subject Cassandra paging, gathering stats
Date Mon, 22 Feb 2010 19:40:06 GMT
Hey,

We are in the process of implementing a cassandra application service.

we have already ingested TB of data using the cassandra bulk loader
(StorageService).

One of the requirements is to get a data explosion factor as a result of
denormalization.  Since the writes are going to the memory tables, I'm not
sure how I could grab stats.  I cant get size of data before ingest since
some of the data may be duplicated.

I was wondering if you knew of any way to do paging over all the keys for a
given Column family.  Or perhaps how I can read from the mem table.  I tried
the following:


                       if (numberOfDocuments > 0 && (numberOfDocuments %
100) == 0) {
                       System.out.println("\nSo far " + numberOfDocuments +
"have been indexed in: " + (System.currentTimeMillis() - t0)/1000 + "
seconds");

                       Iterable<ColumnFamilyStore> cfIt =
storageService.getValidColumnFamilies(keyspaceStr, CF-One,
                                       CF-Two, CF-Three, CF-Four, CF-Five);

                       for (ColumnFamilyStore cfStore : cfIt){
                               double bytes = 0;
                           for (SSTableReader sstable :
cfStore.getSSTables())
                           {
                               bytes += sstable.bytesOnDisk();
                           }
                           System.out.println(" Total size for column
family:" + cfStore.getColumnFamilyName() + " = " +
FileUtils.stringifyFileSize(bytes) );
                       }

                       }


So that is simply putting out size of each column family after ingesting 100
documents.  I'm getting 0 bytes each time.  Any ideas?

Also a general problem we are running into is an easy way to do paging over
the data set (not just rows but columns).  Looks like now the API has ways
to do count, but no offset.

Thanks

Mime
View raw message