Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cassandra-user@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of sonnyheer@gmail.com designates
 209.85.223.182 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:date:message-id:subject:from:to:content-type;
        b=shOd4johKxr4RIxSDPnHcbTKTiNvNZkD9iykErBukmsgIp6jn5aJsGMjPcHUba7S4c
         89IezKA6a/b7bQwNV5oenudW1K510tOBmy1pnjLULpmW7R/KhnY8CWsGmx+Qh40dKiBO
         n+Q7pomNo0n8S5bVoJZZHZ0srqPavx9wXMnZ0=
MIME-Version: 1.0
Date: Mon, 22 Feb 2010 11:40:06 -0800
Message-ID: <991ae7f81002221140i58295305oceafab72824986d9@mail.gmail.com>
Subject: Cassandra paging, gathering stats
From: Sonny Heer <sonnyheer@gmail.com>
To: cassandra-user@incubator.apache.org
Content-Type: multipart/alternative; boundary=0050450169f3ac97c8048035999f

--0050450169f3ac97c8048035999f
Content-Type: text/plain; charset=ISO-8859-1

Hey,

We are in the process of implementing a cassandra application service.

we have already ingested TB of data using the cassandra bulk loader
(StorageService).

One of the requirements is to get a data explosion factor as a result of
denormalization.  Since the writes are going to the memory tables, I'm not
sure how I could grab stats.  I cant get size of data before ingest since
some of the data may be duplicated.

I was wondering if you knew of any way to do paging over all the keys for a
given Column family.  Or perhaps how I can read from the mem table.  I tried
the following:


                       if (numberOfDocuments > 0 && (numberOfDocuments %
100) == 0) {
                       System.out.println("\nSo far " + numberOfDocuments +
"have been indexed in: " + (System.currentTimeMillis() - t0)/1000 + "
seconds");

                       Iterable<ColumnFamilyStore> cfIt =
storageService.getValidColumnFamilies(keyspaceStr, CF-One,
                                       CF-Two, CF-Three, CF-Four, CF-Five);

                       for (ColumnFamilyStore cfStore : cfIt){
                               double bytes = 0;
                           for (SSTableReader sstable :
cfStore.getSSTables())
                           {
                               bytes += sstable.bytesOnDisk();
                           }
                           System.out.println(" Total size for column
family:" + cfStore.getColumnFamilyName() + " = " +
FileUtils.stringifyFileSize(bytes) );
                       }

                       }


So that is simply putting out size of each column family after ingesting 100
documents.  I'm getting 0 bytes each time.  Any ideas?

Also a general problem we are running into is an easy way to do paging over
the data set (not just rows but columns).  Looks like now the API has ways
to do count, but no offset.

Thanks

--0050450169f3ac97c8048035999f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<span class=3D"ul-threaded" style=3D"margin: 0.5em 0pt 0pt -20px;"><span cl=
ass=3D"text-cell">Hey,
<br><br>We are in the process of implementing a cassandra application servi=
ce.
<br><br>we have already ingested TB of data using the cassandra bulk loader=
 (StorageService).
<br><br>One of the requirements is to get a data explosion factor as a
result of denormalization. =A0Since the writes are going to the memory
tables, I&#39;m not sure how I could grab stats. =A0I cant get size of data
before ingest since some of the data may be duplicated.
<br><br>I was wondering if you knew of any way to do paging over all
the keys for a given Column family. =A0Or perhaps how I can read from the
mem table. =A0I tried the following:
<br><br><br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (numberOfDocu=
ments &gt; 0 &amp;&amp; (numberOfDocuments % 100) =3D=3D 0) {
<br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0System.out.println(&quot=
;\nSo far &quot; +
numberOfDocuments + &quot;have been indexed in: &quot; +
(System.currentTimeMillis() - t0)/1000 + &quot; seconds&quot;);
<br><br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Iterable&lt;ColumnFa=
milyStore&gt; cfIt =3D storageService.getValidColumnFamilies(keyspaceStr, C=
F-One,
<br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0CF-Two, CF-Three, CF-Four, CF-Five);
<br><br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for (ColumnFamilySto=
re cfStore : cfIt){
<br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0double b=
ytes =3D 0;
<br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for (SSTableRead=
er sstable : cfStore.getSSTables())
<br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{
<br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0bytes +=
=3D sstable.bytesOnDisk();
<br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
<br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0System.out.print=
ln(&quot; Total size for
column family:&quot; + cfStore.getColumnFamilyName() + &quot; =3D &quot; +
FileUtils.stringifyFileSize(bytes) );
<br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
<br><br>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
<br><br><br>So that is simply putting out size of each column family
after ingesting 100 documents. =A0I&#39;m getting 0 bytes each time. =A0Any
ideas?
<br><br>Also a general problem we are running into is an easy way to
do paging over the data set (not just rows but columns). =A0Looks like
now the API has ways to do count, but no offset.
<br><br>Thanks
=09
	=09
</span></span>

--0050450169f3ac97c8048035999f--