responses below. thanks!

On Fri, Jul 6, 2012 at 3:09 PM, aaron morton <aaron@thelastpickle.com> wrote:
It looks like this happens when there is a promotion failure. 

Java Heap is full. 
Memory is fragmented. 
Use C for web scale. 
unfortunately i became too dumb to use C around 2004. camping accident.

Also is it normal to see the "Heap is xx full.  You may need to reduce memtable and/or cache sizes" message quite often? I haven't turned on row caches or changed any default memtable size settings so I am wondering why the old gen fills up.

It's odd to get that out of the box with an 8GB heap on a 1.1.X install. 

What sort of work load ? Is it under heavy inserts ?
opscenter shows between 60-120 writes/sec and between 80-150 reads/sec total for both machines. i am not sure if that is considered heavy or not. the machines don't seem particularly busy. load seems pretty even across both. 

Do you have a lot of CF's ? A lot of secondary indexes ?
i have 15 column families with maybe 4 that are larger and active. there are a couple secondary indexes. opscenter uses 8 CFs and system 7. total data is ~100GB

After the messages is it able to reduce heap usage ?
 seems like it, they occur every few minutes for awhile and then stop.

Does it seem to correlate to compactions ?
no.
 
Is the node able to get back to a healthy state ?
yes. after the gc finishes it rejoins the cluster.
 
If this is testing are you able to pull back to a workload where the issues doe not appear ? 

i am guessing so. i am running a data-heavy background processing job. when i reduced thread count from 20 to 15 the problem has happened only once in the past 2 days vs 2-3 times a day. we are just starting to use cassandra so i am more worried about when more critical web traffic hits.



Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 7/07/2012, at 4:33 AM, feedly team wrote:

I reduced the load and the problem hasn't been happening as much. After enabling gc logging, I see messages mentioning promotion failed when the pauses happen. It looks like this happens when there is a promotion failure. From reading on the web it looks like I could try reducing the CMSInitiatingOccupancyFraction value and/or decreasing the young gen size to try to avoid this scenario.

Also is it normal to see the "Heap is xx full.  You may need to reduce memtable and/or cache sizes" message quite often? I haven't turned on row caches or changed any default memtable size settings so I am wondering why the old gen fills up.


On Wed, Jul 4, 2012 at 6:28 AM, aaron morton <aaron@thelastpickle.com> wrote:
What accounts for the much larger virtual number? some kind of off-heap memory? 
http://wiki.apache.org/cassandra/FAQ#mmap

I'm a little puzzled as to why I would get such long pauses without swapping. 
The two are not related. On startup the JVM memory is locked so it will not swap, from then on memory management is pretty much up the JVM. 

Getting a lot of ParNew activity does not mean the JVM is low on memory, it means there is a lot of activity in the new heap. 

If you have a lot of insert activity (typically in a load test) you can generate a lot of GC activity. Try reducing the load to a point where it does not ht GC and then increase to find the cause. Also if you can connect JConole to the JVM you may get a better view of the heap usage.

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 3/07/2012, at 3:41 PM, feedly team wrote:

Couple more details. I confirmed that swap space is not being used (free -m shows 0 swap) and cassandra.log has a message like "JNA mlockall successful". top shows the process having 9g in resident memory but 21.6g in virtual...What accounts for the much larger virtual number? some kind of off-heap memory? 

I'm a little puzzled as to why I would get such long pauses without swapping. I uncommented all the gc logging options in cassandra-env.sh to try to see what is going on when the node freezes.

Thanks
Kireet

On Mon, Jul 2, 2012 at 9:51 PM, feedly team <feedlydev@gmail.com> wrote:
Yeah I noticed the leap second problem and ran the suggested fix, but I have been facing these problems before Saturday and still see the occasional failures after running the fix. 

Thanks.


On Mon, Jul 2, 2012 at 11:17 AM, Marcus Both <mboth@terra.com.br> wrote:
Yeah! Look that.
http://arstechnica.com/business/2012/07/one-day-later-the-leap-second-v-the-internet-scorecard/
I had the same problem. The solution was rebooting.

On Mon, 2 Jul 2012 11:08:57 -0400
feedly team <feedlydev@gmail.com> wrote:

> Hello,
>    I recently set up a 2 node cassandra cluster on dedicated hardware. In
> the logs there have been a lot of "InetAddress xxx is now dead' or UP
> messages. Comparing the log messages between the 2 nodes, they seem to
> coincide with extremely long ParNew collections. I have seem some of up to
> 50 seconds. The installation is pretty vanilla, I didn't change any
> settings and the machines don't seem particularly busy - cassandra is the
> only thing running on the machine with an 8GB heap. The machine has 64GB of
> RAM and CPU/IO usage looks pretty light. I do see a lot of 'Heap is xxx
> full. You may need to reduce memtable and/or cache sizes' messages. Would
> this help with the long ParNew collections? That message seems to be
> triggered on a full collection.

--
Marcus Both