Jonathan, sorry for the lengthy emails! Hope this one's more readable.
So I'm fairly convinced it's not a Cassandra-side configuration problem; at least not one that entails tweaking the object count threshold or the memtable size.
Given the client code at http://pastie.org/492753 :
(my hadoop client implementation is pretty much identical)
I ran this ingestion client until it started crawling. Without stopping the previous, I started a separate instance to see if the crawling behavior would be mimicked there; which I'm assuming would happen if my Cassandra instance was caught up in GC. Fortunately, this one ran fine and again crawled when it got through ~15k row inserts. Again, I started a new ingestion instance which also ran fine until it got through ~15k row inserts.
I ran this against a 3 node Cassandra cluster. The jconsole outputs (the 3 cassandra nodes) for this entire scenario are attached to this email as a png (note: I stopped all ingestion @ ~19:55-19:58)
In my storage-conf.xml:
My cassandra table setup looks like the following:
The Cassandra JVMs are all running with -Xmx1500m+ and each dedicated server has 2G+ of RAM.
From: Jonathan Ellis
Sent: Wednesday, May 27, 2009 5:43:55 PM
Subject: Re: Ingesting from Hadoop to Cassandra
On Wed, May 27, 2009 at 6:39 PM, Alexandre Linares wrote:
> So it actually doesn't look blocked, but it's crawling. Of course, in
> Hadoop, it always timed out (10 mins), before I could tell that it was
> crawling (I think)
So, back to the original hypothesis: you need to increase the memory
you are giving to the JVM, (in bin/cassandra.in.sh) or increase the
flush frequency (by lowering the memtable object count threshold).
> Can you reproduce with a non-hadoop client program that you can share here?
BTW, I meant share the client code, not a client thread dump. And
please use attachments for thread dumps or source files; it's really
impossible to read this thread on my phone with everything jammed into
the body. :)