incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Theroux <mthero...@yahoo.com>
Subject Re: Really odd issue (AWS related?)
Date Sun, 28 Apr 2013 16:37:24 GMT
Hello,

We've done some additional monitoring, and I think we have more information.  We've been collecting
vmstat information every minute, attempting to catch  a node with issues,.

So, it appears, that the cassandra node runs fine.  Then suddenly, without any correlation
to any event that I can identify, the I/O wait time goes way up, and stays up indefinitely.
 Even non-cassandra  I/O activities (such as snapshots and backups) start causing large I/O
Wait times when they typically would not.  Previous to an issue, we would typically see I/O
wait times 3-4% with very few blocked processes on I/O.  Once this issue manifests itself,
i/O wait times for the same activities jump to 30-40% with many blocked processes.  The I/O
wait times do go back down when there is literally no activity.   

-  Updating the node to the latest Amazon Linux patches and rebooting the instance doesn't
correct the issue.
-  Backing up the node, and replacing the instance does correct the issue.  I/O wait times
return to normal.

One relatively recent change we've made is we upgraded to m1.xlarge instances which has 4
ephemeral drives available.  We create a logical volume from the 4 drives with the idea that
we should be able to get increased I/O throughput.  When we ran m1.large instances, we had
the same setup, although it was only using 2 ephemeral drives.  We chose to use LVM, vs. madm
because we were having issues having madm create the raid volume reliably on restart (and
research showed that this was a common problem).  LVM just worked (and had worked for months
before this upgrade)..

For reference, this is the script we used to create the logical volume:

vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
blockdev --setra 65536 /dev/mnt_vg/mnt_lv
sleep 2
mkfs.xfs /dev/mnt_vg/mnt_lv
sleep 3
mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
sleep 3

Another tidbit... thus far (and this maybe only a coincidence), we've only had to replace
DB nodes within a single availability zone within us-east.  Other availability zones, in the
same region, have yet to show an issue.

It looks like I'm going to need to replace a third DB node today.  Any advice would be appreciated.

Thanks,
-Mike


On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:

> Thanks.
> 
> We weren't monitoring this value when the issue occurred, and this particular issue has
not appeared for a couple of days (knock on wood).  Will keep an eye out though,
> 
> -Mike
> 
> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
> 
>> top command? st : time stolen from this vm by the hypervisor
>> 
>> jason
>> 
>> 
>> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux <mtheroux2@yahoo.com> wrote:
>> Sorry, Not sure what CPU steal is :)
>> 
>> I have AWS console with detailed monitoring enabled... things seem to track close
to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra
reports the dropped messages,
>> 
>> -Mike
>> 
>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
>> 
>>>> The messages appear right after the node "wakes up".
>>> Are you tracking CPU steal ? 
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Cassandra Consultant
>>> New Zealand
>>> 
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 25/04/2013, at 4:15 AM, Robert Coli <rcoli@eventbrite.com> wrote:
>>> 
>>>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mtheroux2@yahoo.com>
wrote:
>>>>> Another related question.  Once we see messages being dropped on one
node, our cassandra client appears to see this, reporting errors.  We use LOCAL_QUORUM with
a RF of 3 on all queries.  Any idea why clients would see an error?  If only one node reports
an error, shouldn't the consistency level prevent the client from seeing an issue?
>>>> 
>>>> If the client is talking to a broken/degraded coordinator node, RF/CL
>>>> are unable to protect it from RPCTimeout. If it is unable to
>>>> coordinate the request in a timely fashion, your clients will get
>>>> errors.
>>>> 
>>>> =Rob
>>> 
>> 
>> 
> 


Mime
View raw message