Alain,

Can you post your mdadm --detail /dev/md0 output here as well as your iostat -x -d when that happens. A bad ephemeral drive on EC2 is not unheard of.

Alexis | @alq | http://datadog.com

P.S. also, disk utilization is not a reliable metric, iostat's await and svctm are more useful imho.


On Sun, Mar 31, 2013 at 6:03 AM, aaron morton <aaron@thelastpickle.com> wrote:
Ok, if you're going to look into it, please keep me/us posted.
It's not on my radar.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton

On 28/03/2013, at 2:43 PM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:

Ok, if you're going to look into it, please keep me/us posted.

It happen twice for me, the same day, within a few hours on the same node and only happened to 1 node out of 12, making this node almost unreachable.


2013/3/28 aaron morton <aaron@thelastpickle.com>
I noticed this on an m1.xlarge (cassandra 1.1.10) instance today as well, 1 or 2 disks in a raid 0 running at 85 to 100% the others 35 to 50ish. 

Have not looked into it. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton

On 26/03/2013, at 11:57 PM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:

We use C* on m1.xLarge AWS EC2 servers, with 4 disks xvdb, xvdc, xvdd, xvde parts of a logical Raid0 (md0).

I use to see their use increasing in the same way. This morning there was a normal minor compaction followed by messages dropped on one node (out of 12).

Looking closely at this node I saw the following:


On this node, one of the four disks (xvdd) started working hardly while other worked less intensively.

This is quite weird since I always saw this 4 disks being used the exact same way at every moment (as you can see on 5 other nodes or when the node ".239" come back to normal).

Any idea on what happened and on how it can be avoided ?

Alain