As a first step, I'd like to reproduce the test from http://spyced.blogspot.com/2010/01/cassandra-05.html on my current setup.

Can you post the storage-conf.xml that was used so that I can match the settings as much as possible?

Thanks,

   -- Oren

On Jul 1, 2010, at 3:15 AM, Oren Benjamin wrote:

Thanks Jonathan,

It's great that you still manage to help out individual users.  I first came across your blog while looking for a good reusable bloom filter implementation a while back.  Having surveyed every other Java implementation I could find, I ended up extracting the implementation from Cassandra along with the unit tests as you suggested in the post.  I added a few tests of my own and have been using it in projects ever since.  Saved me the trouble of reimplementing and testing - drinks are on me if we ever run into each other.
// End of digression

Yes, I did increase the heap size, however, the pauses were occurring during normal operations (no streaming, compacting, flushing etc.) and the heap was nowhere near full.  After discovering https://issues.apache.org/jira/browse/CASSANDRA-1214 , I changed disk access mode to standard IO and things appear to have stabilized somewhat (albeit at a steep performance cost).

I haven't seen any examples of Cassandra configurations for Rackspace Cloud, so I'll post what I've got running now and the results I've seen so far.

Overview:

6 8GB Rackspace Cloud servers (each configured identically with the exception of two nodes acting as Cassandra seeds)
Applications [mem allocated to JVM]: Cassandra [5GB], Tomcat [500MB],   Zabbix agent (for monitoring)
storage-conf.xml

Setup:

The Tomcat instance hosts a servlet which communicates only with the Cassandra node on localhost (via Hector for monitoring and connection pooling).  The web service provided by the servlet is accessed through HAProxy (although I've done testing both with and without the LB).

Testing:

In my current test setup, I have the key cache on (600,000 keys / node) but row cache and mmap disabled.  The DB is preloaded with 200,000,000 fabricated test keys ("key0", "key1", "key2", ...).  Each key has 3 columns with a small amount of data (between 4 and 64 bytes per column).

Right now I'm testing reads only.  I have 4 servers (also Rackspace Cloud) running multi-threaded query agents to generate concurrent query load (unfortunately, I only just discovered stress.py in contrib - I'll post test results from stress.py as soon as I can).  Each request is for a single key.  Cache hit rate 50%.

Before switching to standard IO, aggregate reads/sec across the cluster would briefly spike to as much as 1000 reads/sec before quickly dropping off, presumably having used up all available RAM.  After switching to standard IO, reads/sec stays relatively stable at 210 reads/sec.  The average read latency across the cluster is about 40 milliseconds.

I realize the dataset is rather large - perhaps more nodes with less RAM would perform better?  On deck is a test with 12 4GB nodes for comparison.

Again, thanks for any pointers that would help in optimizing and validating the installation.  If I can get to a state of performance in the cloud that's in line with expectations from other installations, I'd gladly post the setup instructions and results to help fill out this page: http://wiki.apache.org/cassandra/CloudConfig (Rackspace is conspicuously missing).

  -- Oren


On Jun 30, 2010, at 1:58 AM, Jonathan Ellis wrote:

You could be seeing GC pauses. Did you increase the heap size you gave
Cassandra, when you increased your VM size?

On Tue, Jun 29, 2010 at 11:57 AM, Oren Benjamin <oren@clearspring.com> wrote:
Hi all - first timer here.

I'm experimenting with Cassandra on Rackspace Cloud.  Started with 4GB nodes and saw read latency spikes while streaming was taking place, so I increased to 8GB to see if limited memory was the issue.  Now I'm seeing very strange behavior during any period that writes are taking place.  The entire (6 node) cluster seems to pause for periods of as much as 5-8 sec.  By that I mean all the stats (cpu, disk, network IO monitored via dstat) drop to zero or near zero on all nodes simultaneously.  Does anyone have experience with Cassandra on Rackspace or any idea what's going on here?

The pauses are short enough that it's difficult to introspect the application and determine what it's doing during the pause, but long enough to cause unacceptable latency for any service built on top of it.

Any ideas or debugging methods would be greatly appreciated,

 -- Oren



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com