I reduced the load and the problem hasn't been happening as much. After enabling gc logging, I see messages mentioning promotion failed when the pauses happen. It looks like this happens when there is a promotion failure. From reading on the web it looks like I could try reducing the CMSInitiatingOccupancyFraction value and/or decreasing the young gen size to try to avoid this scenario.

Also is it normal to see the "Heap is xx full.  You may need to reduce memtable and/or cache sizes" message quite often? I haven't turned on row caches or changed any default memtable size settings so I am wondering why the old gen fills up.


On Wed, Jul 4, 2012 at 6:28 AM, aaron morton <aaron@thelastpickle.com> wrote:
What accounts for the much larger virtual number? some kind of off-heap memory? 
http://wiki.apache.org/cassandra/FAQ#mmap

I'm a little puzzled as to why I would get such long pauses without swapping. 
The two are not related. On startup the JVM memory is locked so it will not swap, from then on memory management is pretty much up the JVM. 

Getting a lot of ParNew activity does not mean the JVM is low on memory, it means there is a lot of activity in the new heap. 

If you have a lot of insert activity (typically in a load test) you can generate a lot of GC activity. Try reducing the load to a point where it does not ht GC and then increase to find the cause. Also if you can connect JConole to the JVM you may get a better view of the heap usage.

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 3/07/2012, at 3:41 PM, feedly team wrote:

Couple more details. I confirmed that swap space is not being used (free -m shows 0 swap) and cassandra.log has a message like "JNA mlockall successful". top shows the process having 9g in resident memory but 21.6g in virtual...What accounts for the much larger virtual number? some kind of off-heap memory? 

I'm a little puzzled as to why I would get such long pauses without swapping. I uncommented all the gc logging options in cassandra-env.sh to try to see what is going on when the node freezes.

Thanks
Kireet

On Mon, Jul 2, 2012 at 9:51 PM, feedly team <feedlydev@gmail.com> wrote:
Yeah I noticed the leap second problem and ran the suggested fix, but I have been facing these problems before Saturday and still see the occasional failures after running the fix. 

Thanks.


On Mon, Jul 2, 2012 at 11:17 AM, Marcus Both <mboth@terra.com.br> wrote:
Yeah! Look that.
http://arstechnica.com/business/2012/07/one-day-later-the-leap-second-v-the-internet-scorecard/
I had the same problem. The solution was rebooting.

On Mon, 2 Jul 2012 11:08:57 -0400
feedly team <feedlydev@gmail.com> wrote:

> Hello,
>    I recently set up a 2 node cassandra cluster on dedicated hardware. In
> the logs there have been a lot of "InetAddress xxx is now dead' or UP
> messages. Comparing the log messages between the 2 nodes, they seem to
> coincide with extremely long ParNew collections. I have seem some of up to
> 50 seconds. The installation is pretty vanilla, I didn't change any
> settings and the machines don't seem particularly busy - cassandra is the
> only thing running on the machine with an 8GB heap. The machine has 64GB of
> RAM and CPU/IO usage looks pretty light. I do see a lot of 'Heap is xxx
> full. You may need to reduce memtable and/or cache sizes' messages. Would
> this help with the long ParNew collections? That message seems to be
> triggered on a full collection.

--
Marcus Both