cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeromy Grimmett <jer...@cloudbrix.com>
Subject RE: [Proposal] - StorageHA
Date Tue, 14 Mar 2017 04:31:31 GMT
I apologize for the delay on the response, let me clarify the points requested:

Mike asked:

"What I was curious about is if you plan to exclusively build your feature as a set of scripts
and/or if you plan to update the CloudStack code base, as well."

JG:  My idea was to do this separately as a plugin, then add it to the code base down the
road.

"Also, if a primary storage actually goes offline, I'm not clear on how starting an impacted
VM on a different compute host would help. Could you clarify this for me?"

JG:  The VM would be started on another host that still has access to the storage.  Individually
a host can have problems and lose its connectivity to a primary storage device.  The solution
we are working on would help to get the VM back and up running much faster than waiting for
Cloudstack to make a decision to restart the VM on a different host.

Paul asked:

"  1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)"

JG:  I should have been more clear, this is for KVM hosts.
  
"2.  I know of one failure scenario (which happened) where MTU issues in intermediate switches
meant that small amounts of data could pass, but anything that was passed as jumbo frames
then failed. So it would be important to exercise that."

JG:  I have faced this Jumbo Frame issue as well, perhaps we need to have an option that would
indicate Jumbo Frames are being used to access that storage and the test result would reflect
a failure to access using Jumbo Frames. 

"3.  You need to be very sure of failures before shutting hosts down.  Also a host is likely
to be connected to multiple storage pools, so you wouldn't want to shut down a host due to
one pool becoming unavailable."

JG:  The script wouldn’t shut down any hosts at all.  Just force stop the affected VMs on
that specific host and then start them on a host that is not having the issue with storage.

"4.  Environments can have hundreds of storage pools, so watch out for spamming the logs with
updates."

JG:  The polling/testing time increments are configurable, so I am hoping that can help with
that.  The results are pretty small and should be relatively negligible.

"5.  The primary storage pools have a 'state' which should get updated and used by the deployment
planners"

JG:  I have copied Alex on this email to make sure he sees this suggestion.  We will figure
out how to incorporate that 'state' field.

"6.  Secondary storage pools don't have a 'state' - but it would be great if that were added
in the DB and reflected in the UI."

JG:  For now, I think this might be a feature request that maybe we should submit through
the normal Cloudstack request process.  Otherwise, we can definitely include that into our
work when we start to add it into the code base.

To take this a step further, we are also working on a KVM host load balancer that will be
used as a factor when moving the VMs.  We have a number of little projects we are working
on.

Thank you all for reviewing the information.  All suggestions are welcome.

Jeromy Grimmett
P: 603.766.3625
jeromy@cloudbrix.com
www.cloudbrix.com


-----Original Message-----
From: Paul Angus [mailto:paul.angus@shapeblue.com] 
Sent: Saturday, March 11, 2017 2:43 AM
To: dev@cloudstack.apache.org
Subject: RE: [Proposal] - StorageHA

Hi Jeromy,

I love the idea, I'm not really a developer, so those guys will look at things a different
way, but...

These would be by my initial comments:


  1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)
  2.  I know of one failure scenario (which happened) where MTU issues in intermediate switches
meant that small amounts of data could pass, but anything that was passed as jumbo frames
then failed. So it would be important to exercise that.
  3.  You need to be very sure of failures before shutting hosts down.  Also a host is likely
to be connected to multiple storage pools, so you wouldn't want to shut down a host due to
one pool becoming unavailable.
  4.  Environments can have hundreds of storage pools, so watch out for spamming the logs
with updates.
  5.  The primary storage pools have a 'state' which should get updated and used by the deployment
planners
  6.  Secondary storage pools don't have a 'state' - but it would be great if that were added
in the DB and reflected in the UI.



Kind regards,

Paul Angus


paul.angus@shapeblue.com
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
  
 

From: Jeromy Grimmett [mailto:jeromy@cloudbrix.com]
Sent: 10 March 2017 15:28
To: dev@cloudstack.apache.org
Subject: [Proposal] - StorageHA

Hello,

I am new to the mailing list, and we are glad to be a part of the CloudStack community.  We
are looking to develop plugins and modules that will help grow and expand the adoption and
use of CloudStack.  So as part of my introductory email, I'd like to introduce a little project
we have been working on; a StorageHA Monitor.  The Monitor would allow CloudStack and the
hosts to test, communicate and resolve VM availability issues when storage (primary and/or
secondary) availability becomes apparent.  This is a small write up about how it would work:

Consists of two scripts/programs:

The host script runs on the host servers and checks to see if the primary and secondary storage
is available by doing a read/write test then reports to the master script that runs on the
Cloudstack server. The host script will test a read and a write to the storage every 5 seconds
(configurable), and if it fails 3 times (configurable) then it will be recorded by the master
script.

The master script will monitor the results of the host script. If the test is good, nothing
happens and the results are logged and so that we can track the history of the test results.
If the test reports back as failed, then it will perform the following actions:


  *   Secondary Storage - It will simply generate and send an alert that the failure has occurred.


  *   Primary Storage - The script will perform the following tasks:
     *   Generate and send an alert that the failure has occurred.
     *   Force the VMs on that host to shutdown.
     *   Determine which host to move the VMs to.
     *   Start the VMs on the healthy host.

We have already started working on some code, and the solution seems to be testing well. 
Any thoughts/ideas/input are(is) welcome.  Should there are a solution out there already,
then please forgive our ignorance, and point us in the right direction. We look forward to
further collaboration with you all.

Regards,
j

Jeromy Grimmett
[cb-sig-logo2]
155 Fleet Street
Portsmouth, NH 03801
Direct: 603.766.3625
Office: 603.766.4908
Fax: 603.766.4729
jeromy@cloudbrix.com<mailto:jeromy@cloudbrix.com>
www.cloudbrix.com<http://www.cloudbrix.com/>

Mime
View raw message