Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5729BF80A for ; Sun, 31 Mar 2013 16:59:26 +0000 (UTC) Received: (qmail 88401 invoked by uid 500); 31 Mar 2013 16:59:23 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 88362 invoked by uid 500); 31 Mar 2013 16:59:23 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 88349 invoked by uid 99); 31 Mar 2013 16:59:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 31 Mar 2013 16:59:23 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.210.180] (HELO mail-ia0-f180.google.com) (209.85.210.180) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 31 Mar 2013 16:59:17 +0000 Received: by mail-ia0-f180.google.com with SMTP id f27so1387822iae.39 for ; Sun, 31 Mar 2013 09:58:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datadoghq.com; s=google; h=x-received:mime-version:x-originating-ip:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=dYNUmxf4sJwNi+6fsjuKbwsrKLvG6dl9qt5G3QCAwmA=; b=d2lPHZWMOsDeisB2/pAAuumC/Nyj5JCNat92wsh+Eb86TFX8xj8LXCANcHrzjY2Xvi afJh+QCxBHBmcL1/Vfx1n7H6puwwgE29O1LP80VCmbhJwB9QrW1siPwaqMMDiJO9cq4k t6yOy7JgA5hkecNw2ew55aYo0QTznc+QHNjP4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:x-originating-ip:in-reply-to:references :from:date:message-id:subject:to:content-type:x-gm-message-state; bh=dYNUmxf4sJwNi+6fsjuKbwsrKLvG6dl9qt5G3QCAwmA=; b=RGP1jwHgck5z46r1sZ1ud4n4tA9+RFC7R4QNWdWBESRWLzlziKTngAAYlJRFHeXRfG PmEJHCUUyPvLChpnVRf607jEabE8MdU4mtgauUjFaZxVOgCDcf+VDdSMkB49Xj30VfRL hvmGK59tXW4ItPNLjfStn2IjPBgBwVeODAgF6HNqCyKNF7vi4+BCjdGtU/z3DHh1JGTL eZBEwhSht8uhQnZydgD0xgvGqT90qKvj5uiQcSv7wgEFAxq+gq6+F3H2/QXX2yJlMBQy C06qvMZpks5sdPkHogejsd+Ok5vH2DRnQ/G7y6mWIzqJpJU/+uRCu/gjpXnr7pAfHARz 4AGg== X-Received: by 10.50.37.132 with SMTP id y4mr2363338igj.61.1364749134377; Sun, 31 Mar 2013 09:58:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.50.237.101 with HTTP; Sun, 31 Mar 2013 09:58:34 -0700 (PDT) X-Originating-IP: [24.46.136.206] In-Reply-To: <4FDE09C8-F538-4A70-ACAD-5B2FA268AEB3@thelastpickle.com> References: <4FDE09C8-F538-4A70-ACAD-5B2FA268AEB3@thelastpickle.com> From: =?UTF-8?B?QWxleGlzIEzDqi1RdcO0Yw==?= Date: Sun, 31 Mar 2013 12:58:34 -0400 Message-ID: Subject: Re: weird behavior with RAID 0 on EC2 To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=f46d0447f3f85cbe0f04d93b6bc1 X-Gm-Message-State: ALoCoQmpGg5Q2xeoeyRk2v8+aHHGy04R1933SePXigmlQIsV+Kpc0DZXuiD6vHrFgTjUIxNfWAgQ X-Virus-Checked: Checked by ClamAV on apache.org --f46d0447f3f85cbe0f04d93b6bc1 Content-Type: text/plain; charset=UTF-8 Alain, Can you post your mdadm --detail /dev/md0 output here as well as your iostat -x -d when that happens. A bad ephemeral drive on EC2 is not unheard of. Alexis | @alq | http://datadog.com P.S. also, disk utilization is not a reliable metric, iostat's await and svctm are more useful imho. On Sun, Mar 31, 2013 at 6:03 AM, aaron morton wrote: > Ok, if you're going to look into it, please keep me/us posted. > > It's not on my radar. > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 28/03/2013, at 2:43 PM, Alain RODRIGUEZ wrote: > > Ok, if you're going to look into it, please keep me/us posted. > > It happen twice for me, the same day, within a few hours on the same node > and only happened to 1 node out of 12, making this node almost unreachable. > > > 2013/3/28 aaron morton > >> I noticed this on an m1.xlarge (cassandra 1.1.10) instance today as well, >> 1 or 2 disks in a raid 0 running at 85 to 100% the others 35 to 50ish. >> >> Have not looked into it. >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Consultant >> New Zealand >> >> @aaronmorton >> http://www.thelastpickle.com >> >> On 26/03/2013, at 11:57 PM, Alain RODRIGUEZ wrote: >> >> We use C* on m1.xLarge AWS EC2 servers, with 4 disks xvdb, xvdc, xvdd, >> xvde parts of a logical Raid0 (md0). >> >> I use to see their use increasing in the same way. This morning there was >> a normal minor compaction followed by messages dropped on one node (out of >> 12). >> >> Looking closely at this node I saw the following: >> >> http://img69.imageshack.us/img69/9425/opscenterweirddisk.png >> >> On this node, one of the four disks (xvdd) started working hardly while >> other worked less intensively. >> >> This is quite weird since I always saw this 4 disks being used the exact >> same way at every moment (as you can see on 5 other nodes or when the node >> ".239" come back to normal). >> >> Any idea on what happened and on how it can be avoided ? >> >> Alain >> >> >> > > --f46d0447f3f85cbe0f04d93b6bc1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Alain,

Can you post your mdadm --= detail /dev/md0 output here as well as your iostat -x -d when that happens.= A bad ephemeral drive on EC2 is not unheard of.

Alexis | @alq | http://datadog.co= m

P.S. also, disk utilization is n= ot a reliable metric, iostat's await and svctm are more useful imho.


On Sun,= Mar 31, 2013 at 6:03 AM, aaron morton <aaron@thelastpickle.com&= gt; wrote:
Ok, if you're goin= g to look into it, please keep me/us posted.
It's not on my radar.

Cheers

-----------------
Aaron Morton
Freelance Cassandra= Consultant
New Zealand


On 28/03/2013, at 2:43 PM, Alain= RODRIGUEZ <arod= rime@gmail.com> wrote:

Ok, if you're going to look into it, please keep me/us posted.

=
It happen twice for me, the same day, within a few hours on the = same node and only happened to 1 node out of 12, making this node almost un= reachable.


2013/3/= 28 aaron morton <aaron@thelastpickle.com>
I noticed this on an m1.xlarge (cassand= ra 1.1.10) instance today as well, 1 or 2 disks in a raid 0 running at 85 t= o 100% the others 35 to 50ish.=C2=A0

Have not looked int= o it.=C2=A0

Cheers

-----------------
Aaron Morton
Freelance Cassandra= Consultant
New Zealand


On 26/03/2013, at 11:57 PM, Alain RODRIGUEZ <arodrime@gmail.com> wrote= :

We use C* on m1.xLarg= e AWS EC2 servers, with 4 disks xvdb, xvdc, xvdd, xvde parts of a logical R= aid0 (md0).

I use to see their use increasing in the same way. This= morning there was a normal minor compaction followed by messages dropped o= n one node (out of 12).

Looking closely at this node I saw the following:
=


On this node, one of the four disks (xvdd) starte= d working hardly while other worked less intensively.

<= div>This is quite weird since I always saw this 4 disks being used the exac= t same way at every moment (as you can see on 5 other nodes or when=C2=A0th= e node ".239"=C2=A0come back to normal).

Any idea on what happened and on how it can be avoided = ?

Alain




--f46d0447f3f85cbe0f04d93b6bc1--