I didn't launched these commands when in troubles, next time I will. For now here is what I have (this is working properly for now).

$mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sun Mar 17 01:46:05 2013
     Raid Level : raid0
     Array Size : 1761459200 (1679.86 GiB 1803.73 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Sun Mar 17 01:46:05 2013
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 256K

           Name : ip-xxx-xxx-xxx-239:0  (local to host ip-xxx-xxx-xxx-239)
           UUID : 2cbc3efe:11f8f35d:b4f55c81:3903c715
         Events : 0

    Number   Major   Minor   RaidDevice State
       0     202       17        0      active sync   /dev/xvdb1
       1     202       33        1      active sync   /dev/xvdc1
       2     202       49        2      active sync   /dev/xvdd1
       3     202       65        3      active sync   /dev/xvde1


$iostat -x -d
Linux 3.2.0-35-virtual (ip-xxx-xxx-xxx-239)       04/02/2013      _x86_64_  (4 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdap1            0.00     0.30    0.29    0.68     4.41     6.03    21.52     0.01    6.18   18.40    1.03   2.73   0.26
xvdb              0.00     0.05   59.36    3.57  1601.94   144.99    55.51     0.85   13.50   12.43   31.16   2.56  16.12
xvdc              0.00     0.01   59.33    3.48  1601.75   144.81    55.62     0.81   12.92   11.83   31.62   2.51  15.77
xvdd              0.00     0.05   59.31    3.53  1601.69   144.83    55.58     1.25   19.96   18.99   36.28   3.00  18.85
xvde              0.00     0.01   59.30    3.45  1601.62   144.46    55.65     1.04   16.50   15.34   36.37   2.85  17.87
md0               0.00     0.00  237.36   14.14  6406.99   579.10    55.56     0.00    0.00    0.00    0.00   0.00   0.00

@Rudolf

Thanks for the insight, I might use this solution too next time.

Alain

2013/3/31 Rudolf van der Leeden <rudolf.vanderleeden@scoreloop.com>
I've seen the same behaviour (SLOW ephemeral disk) a few times. 
You can't do anything with a single slow disk except not using it. 
Our solution was always: Replace the m1.xlarge instance asap and everything is good.
-Rudolf.

On 31.03.2013, at 18:58, Alexis Lê-Quôc wrote:

Alain,

Can you post your mdadm --detail /dev/md0 output here as well as your iostat -x -d when that happens. A bad ephemeral drive on EC2 is not unheard of.

Alexis | @alq | http://datadog.com

P.S. also, disk utilization is not a reliable metric, iostat's await and svctm are more useful imho.


On Sun, Mar 31, 2013 at 6:03 AM, aaron morton <aaron@thelastpickle.com> wrote:
Ok, if you're going to look into it, please keep me/us posted.
It's not on my radar.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton

On 28/03/2013, at 2:43 PM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:

Ok, if you're going to look into it, please keep me/us posted.

It happen twice for me, the same day, within a few hours on the same node and only happened to 1 node out of 12, making this node almost unreachable.


2013/3/28 aaron morton <aaron@thelastpickle.com>
I noticed this on an m1.xlarge (cassandra 1.1.10) instance today as well, 1 or 2 disks in a raid 0 running at 85 to 100% the others 35 to 50ish. 

Have not looked into it. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton

On 26/03/2013, at 11:57 PM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:

We use C* on m1.xLarge AWS EC2 servers, with 4 disks xvdb, xvdc, xvdd, xvde parts of a logical Raid0 (md0).

I use to see their use increasing in the same way. This morning there was a normal minor compaction followed by messages dropped on one node (out of 12).

Looking closely at this node I saw the following:


On this node, one of the four disks (xvdd) started working hardly while other worked less intensively.

This is quite weird since I always saw this 4 disks being used the exact same way at every moment (as you can see on 5 other nodes or when the node ".239" come back to normal).

Any idea on what happened and on how it can be avoided ?

Alain