cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adriano Paterlini <paterl...@usp.br>
Subject Re: Cloudstack + XenServer 6.2 + NetApp in production
Date Sun, 15 Feb 2015 21:28:56 GMT
Yiping,

We do have a production environment with similar configuration, you can
check some parameters and logs.

First of all, xenserver nfs timeout will occur every time nfs server takes
more than 13.3 (40.0/3.0) seconds to answer read or write nfs calls, this
is defined as SOFTMOUNT_TIMEOUT at /opt/xensource/sm/nfs.py. There are some
xenserver forum discussions about changing this parameter, my conclusion
that its not recommended, the consequence would be virtual machines going
into ready only mode, unless vm parameters are also modified, linux
defaults usually are 30 seconds. NFS timeouts are shown at
/var/log/kern.log.

However, the timeout itself does not cause host reboot, the reboot is
probably due to cloudstack HA storage fence, just as Tim mentioned, storage
fence is enforced at the script  /opt/cloud/bin/xenheartbeat.sh, you can
check for the log entries to confirm if it was the case. If its is really
the case you case adjust the cloudstack global settings parameters
xenserver.heartbeat.interval and xenserver.heartbeat.timeout to accommodate
planned maintenance and even automatic storage side HA, you should check
with Netapp for recommend values for your environment, takeover/giveback
delays may vary according to controller version and even current controller
load, Netapp documentation mention 180 seconds as maximum delay. Also check
if script is running correctly #ps -aux | grep heartbeat, it should take 3
parameters, if not you may be affected by
https://issues.apache.org/jira/browse/CLOUDSTACK-7184.

Hope the comments help your decision.


Regards,
Adriano


On Sun, Feb 15, 2015 at 12:38 PM, <cyrano@usp.br> wrote:

> FYI
>
> Sent from my iPhone
>
> Begin forwarded message:
>
> *From:* Yiping Zhang <yzhang@marketo.com>
> *Date:* February 15, 2015 at 2:00:05 AM GMT-2
> *To:* "users@cloudstack.apache.org" <users@cloudstack.apache.org>
> *Subject:* *Re: Cloudstack + XenServer 6.2 + NetApp in production*
> *Reply-To:* <users@cloudstack.apache.org>
>
> Tim,
>
> Thanks, for the reply.
>
> In our case, the NetApp cluster as a whole did not fail.  The NetApp
> cluster failover was happening because Operations team was performing a
> scheduled maintenance, this is normal behavior. To best of my knowledge,
> NetApp head failover should take anywhere 10-15 seconds.
>
> As you guessed correctly, our XenServer resource pool does have HA
> enabled, and HA shared SR is indeed on the same NetApp cluster as the
> primary storage SR.  Though I am not sure if enabling xen pool HA is the
> cause of xenserver¹s  rebooting under this particular scenario.
>
> I am not sure if I understand your statement that "In that case, HA would
> detect the storage failure and fence the XenServer host².  Can you
> elaborate a little more on this statement?
>
> Thanks again,
>
> Yiping
>
>
> On 2/14/15, 6:26 AM, "Tim Mackey" <tmackey@gmail.com> wrote:
>
> Yiping,
>
>
> The specific problem covered by that note was solved a long time ago.
>
> Timeouts can be caused by a number of things, and if the entire NetApp
>
> cluster went offline, the XenServer host would be impacted.  Since you are
>
> experiencing a host reboot when this happens, I suspect you have XenServer
>
> HA enabled with the heartbeat on the same NetApp cluster.  In that case,
>
> HA
>
> would detect the storage failure and fence the XenServer host.
>
>
> The solution here would be to understand why your NetApp cluster failed
>
> during scheduled maintenance. Something in your configuration has created
>
> a
>
> single point of failure. If you've enabled HA, I also would like to
>
> understand why you've chosen to do that.  Going slightly commercial for a
>
> second, I would also advise you to look into a commercial support contract
>
> for your production XenServer hosts. That team is going to be able to go
>
> deeper, and much quicker, when production issues arise than this list.
>
> NetApp and XenServer is used in a very large number of deployments, so if
>
> there is something wrong they'll be more likely to know. For example,
>
> there
>
> could be a set of XenServer or OnTap patches to help sort this out.
>
>
> -tim
>
>
> On Fri, Feb 13, 2015 at 7:36 PM, Yiping Zhang <yzhang@marketo.com> wrote:
>
>
> Hi, all:
>
>
> I am wondering if any one is running their CloudStack in production
>
> deployments with  XenServer 6.2 + NetApp clusters ?
>
>
> Recently, in our non production deployment (rhel 6.6 + CS 4.3.0 +
>
> XenServer 6.2 cluster + NetApp cluster), all our XenServer rebooted
>
> automatically because of NFS timeout, when our NetApp cluster failover
>
> happened during a scheduled filer maintenance. My google search turned
>
> up
>
> this Citrix hot fix: http://support.citrix.com/article/CTX135623 for
>
> XenServer 6.0.2, and this post about XenServer 6.2:
>
> http://www.gossamer-threads.com/lists/xen/devel/320020 .
>
>
> Obviously the problem still exists for XenServer 6.2 and we are very
>
> concerned about going to production deployment based on this technology
>
> stack.
>
>
> If anyone has a similar setup, please share your experiences.
>
>
> Thanks,
>
>
> Yiping
>
>
>
>
>
>


-- 
Adriano Arantes Paterlini
Analista de Sistemas
Centro de Tecnologia da Informação - CeTI-SP
Superintendência de Tecnologia da Informação - STI
Universidade de São Paulo

Fone: +55 (11) 3091-0494

Av. Professor Luciano Gualberto, 71, tv. 3
Cidade Universitária - São Paulo / SP

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message