Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 62357 invoked from network); 22 Feb 2010 19:40:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 22 Feb 2010 19:40:36 -0000 Received: (qmail 289 invoked by uid 500); 22 Feb 2010 19:40:36 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 277 invoked by uid 500); 22 Feb 2010 19:40:36 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 268 invoked by uid 99); 22 Feb 2010 19:40:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Feb 2010 19:40:36 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sonnyheer@gmail.com designates 209.85.223.182 as permitted sender) Received: from [209.85.223.182] (HELO mail-iw0-f182.google.com) (209.85.223.182) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Feb 2010 19:40:28 +0000 Received: by iwn12 with SMTP id 12so2465704iwn.21 for ; Mon, 22 Feb 2010 11:40:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=Aqg9l4IF0cPwy4OH4FrBBd+wmSdCHxNJlxNartopoco=; b=OEuygCuM7ZVKmrjVH/btOJzZkWrbVziqnwZE6ph10+h7uo/titDudOT62j1XLnDjlh R7vU7TC/uzraVLMeNCZ/+gd4qrxH8jZWdpUtAoyEsF3/oh9O6kg2EUvEgiuC6SXqWJaK SRe1Qw8XC16XWLtE0ZCXi0MApoZLiFX1u6aOo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=shOd4johKxr4RIxSDPnHcbTKTiNvNZkD9iykErBukmsgIp6jn5aJsGMjPcHUba7S4c 89IezKA6a/b7bQwNV5oenudW1K510tOBmy1pnjLULpmW7R/KhnY8CWsGmx+Qh40dKiBO n+Q7pomNo0n8S5bVoJZZHZ0srqPavx9wXMnZ0= MIME-Version: 1.0 Received: by 10.231.157.83 with SMTP id a19mr384155ibx.41.1266867606689; Mon, 22 Feb 2010 11:40:06 -0800 (PST) Date: Mon, 22 Feb 2010 11:40:06 -0800 Message-ID: <991ae7f81002221140i58295305oceafab72824986d9@mail.gmail.com> Subject: Cassandra paging, gathering stats From: Sonny Heer To: cassandra-user@incubator.apache.org Content-Type: multipart/alternative; boundary=0050450169f3ac97c8048035999f X-Virus-Checked: Checked by ClamAV on apache.org --0050450169f3ac97c8048035999f Content-Type: text/plain; charset=ISO-8859-1 Hey, We are in the process of implementing a cassandra application service. we have already ingested TB of data using the cassandra bulk loader (StorageService). One of the requirements is to get a data explosion factor as a result of denormalization. Since the writes are going to the memory tables, I'm not sure how I could grab stats. I cant get size of data before ingest since some of the data may be duplicated. I was wondering if you knew of any way to do paging over all the keys for a given Column family. Or perhaps how I can read from the mem table. I tried the following: if (numberOfDocuments > 0 && (numberOfDocuments % 100) == 0) { System.out.println("\nSo far " + numberOfDocuments + "have been indexed in: " + (System.currentTimeMillis() - t0)/1000 + " seconds"); Iterable cfIt = storageService.getValidColumnFamilies(keyspaceStr, CF-One, CF-Two, CF-Three, CF-Four, CF-Five); for (ColumnFamilyStore cfStore : cfIt){ double bytes = 0; for (SSTableReader sstable : cfStore.getSSTables()) { bytes += sstable.bytesOnDisk(); } System.out.println(" Total size for column family:" + cfStore.getColumnFamilyName() + " = " + FileUtils.stringifyFileSize(bytes) ); } } So that is simply putting out size of each column family after ingesting 100 documents. I'm getting 0 bytes each time. Any ideas? Also a general problem we are running into is an easy way to do paging over the data set (not just rows but columns). Looks like now the API has ways to do count, but no offset. Thanks --0050450169f3ac97c8048035999f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hey,

We are in the process of implementing a cassandra application servi= ce.

we have already ingested TB of data using the cassandra bulk loader= (StorageService).

One of the requirements is to get a data explosion factor as a result of denormalization. =A0Since the writes are going to the memory tables, I'm not sure how I could grab stats. =A0I cant get size of data before ingest since some of the data may be duplicated.

I was wondering if you knew of any way to do paging over all the keys for a given Column family. =A0Or perhaps how I can read from the mem table. =A0I tried the following:


=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (numberOfDocu= ments > 0 && (numberOfDocuments % 100) =3D=3D 0) {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0System.out.println("= ;\nSo far " + numberOfDocuments + "have been indexed in: " + (System.currentTimeMillis() - t0)/1000 + " seconds");

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Iterable<ColumnFa= milyStore> cfIt =3D storageService.getValidColumnFamilies(keyspaceStr, C= F-One,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0CF-Two, CF-Three, CF-Four, CF-Five);

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for (ColumnFamilySto= re cfStore : cfIt){
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0double b= ytes =3D 0;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for (SSTableRead= er sstable : cfStore.getSSTables())
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0bytes += =3D sstable.bytesOnDisk();
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0System.out.print= ln(" Total size for column family:" + cfStore.getColumnFamilyName() + " =3D " + FileUtils.stringifyFileSize(bytes) );
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}


So that is simply putting out size of each column family after ingesting 100 documents. =A0I'm getting 0 bytes each time. =A0Any ideas?

Also a general problem we are running into is an easy way to do paging over the data set (not just rows but columns). =A0Looks like now the API has ways to do count, but no offset.

Thanks =09 =09
--0050450169f3ac97c8048035999f--