incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philippe Dupont <pdup...@teads.tv>
Subject Re: Raid Issue on EC2 Datastax ami, 1.2.11
Date Thu, 05 Dec 2013 15:42:20 GMT
Hi again,

I have much more in formations on this case :

We did further investigations on the nodes affected and did find some await
problems on one of the 4 disk in raid:
http://imageshack.com/a/img824/2391/s7q3.jpg

Here was the iostat of the node :
http://imageshack.us/a/img7/7282/qq3w.png<http://www.google.com/url?q=http%3A%2F%2Fimageshack.us%2Fa%2Fimg7%2F7282%2Fqq3w.png&sa=D&sntz=1&usg=AFQjCNGTu2l8P6sedK0Wc9lhoI6_3O3ixw>

You can see that the write and read throughput are exactly the same on the
4 disks of the instance. So the raid0 looks good enough. Yet, the global
await, r_await and w_await are 3 to 5 times bigger on xvde disk than in
other disks.

We reported this to amazon support, and there is their answer :
" Hello,
I deeply apologize for any inconvenience this has been causing you and
thank you for the additional information and screenshots. Using the
instance you based your "iostat" on ("i-xxxxxxxx"), I have looked into the
underlying hardware it is currently using and I can see it appears to have
a noisy neighbor leading to the higher "await" time on that particular
device. Since most AWS services are multi-tenant, situations can arise
where one customer's resource has the potential to impact the performance
of a different customer's resource that reside on the same underlying
hardware (a "noisy neighbor"). While these occurrences are rare, they are
nonetheless inconvenient and I am very sorry for any impact it has created.
I have also looked into the initial instance referred to when the case was
created ("i-xxxxxxx") and cannot see any existing issues (neighboring or
otherwise) as to any I/O performance impacts; however, at the time the case
was created, evidence on our end suggests there was a noisy neighbor then
as well. Can you verify if you are still experiencing above average "await"
times on this instance? If you would like to mitigate the impact of
encountering "noisy neighbors", you can look into our Dedicated Instance
option; Dedicated Instances launch on hardware dedicated to only a single
customer (though this can feasibly lead to a situation where a customer is
their own noisy neighbor). However, this is an option available only to
instances that are being launched into a VPC and may require modification
of the architecture of your use-case. I understand the instances belonging
to your cluster in question have been launched into EC2-Classic, I just
wanted to bring this your attention as a possible solution. You can read
more about Dedicated Instances here:
http://aws.amazon.com/dedicated-instances/ Again, I am very sorry for the
performance impact you have been experiencing due to having noisy
neighbors. We understand the frustration and are always actively working to
increase capacity so the effects of noisy neighbors is lessened. I hope
this information has been useful and if you have any additional questions
whatsoever, please do not hesitate to ask! "

To conclude, the only other solution to avoid VPC and Reserved Instance is
to replace this instance by a new one, hoping to not having other "Noisy
neighbors"...
I hope that will help someone.

Philippe


2013/11/28 Philippe DUPONT <pdupont@teads.tv>

> Hi,
>
> We have a Cassandra cluster of 28 nodes. Each one is an EC2 m1.xLarge
> based on datastax AMI with 4 storage in raid0 mode.
>
> Here is the ticket we opened with amazon support :
>
> "This raid is created using the datastax public AMI : ami-b2212dc6.
> Sources are also available here : https://github.com/riptano/ComboAMI
>
> As you can see in the screenshot attached (
> http://imageshack.com/a/img854/4592/xbqc.jpg)  randomly but frequently
> one of the storage get fully used (100%) but 3 others are standing in low
> use.
>
> Because of this, the node becomes slow and the whole cassandra cluster is
> impacted. We are losing data due to writes fails and availability for our
> customers.
>
> it was in this state for one hour, and we decided to restart it.
>
> We already removed 3 other instances because of this same issue."
> (see other screenshots)
> http://imageshack.com/a/img824/2391/s7q3.jpg
> http://imageshack.com/a/img10/556/zzk8.jpg
>
> Amazon support took a close look at the instance as well as it's
> underlying hardware for any potential health issues and both seem to be
> healthy.
>
> Have someone already experienced something like this ?
>
> Should I contact the AMI author better?
>
> Thanks a lot,
>
> Philippe.
>
>
>
>

Mime
View raw message