cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yiping Zhang <yzh...@marketo.com>
Subject Re: Cloudstack + XenServer 6.2 + NetApp in production
Date Mon, 16 Feb 2015 03:25:48 GMT
Hi, Tim and Adriano:

Thanks very much for detailed and very insightful replies.

After going back and reread the log files again, I am fully convinced that
it was indeed that the HA feature caused the actual xenservers fencing
themselves. I’ll follow that Citrix support article to determine best HA
timeout value to use in our environment.

Yiping


On 2/15/15, 1:50 PM, "Tim Mackey" <tmackey@gmail.com> wrote:

>Here's a KB which covers how to change the *XenServer* HA setting (not the
>CloudStack one): http://support.citrix.com/article/CTX139166.  It would be
>good to check /var/log/xha.log to see if any issues were logged there.
>Also note that with HA you want always have your hosts NTP sync'd.  With
>the default timeout being 30 seconds, I'd start by verifying from your
>NetApp admins how long the head was actually offline.  I'd also look into
>any network config issues (assuming you've bonded your storage network).
>
>-tim
>
>On Sun, Feb 15, 2015 at 4:28 PM, Adriano Paterlini <paterlini@usp.br>
>wrote:
>
>> Yiping,
>>
>> We do have a production environment with similar configuration, you can
>> check some parameters and logs.
>>
>> First of all, xenserver nfs timeout will occur every time nfs server
>>takes
>> more than 13.3 (40.0/3.0) seconds to answer read or write nfs calls,
>>this
>> is defined as SOFTMOUNT_TIMEOUT at /opt/xensource/sm/nfs.py. There are
>>some
>> xenserver forum discussions about changing this parameter, my conclusion
>> that its not recommended, the consequence would be virtual machines
>>going
>> into ready only mode, unless vm parameters are also modified, linux
>> defaults usually are 30 seconds. NFS timeouts are shown at
>> /var/log/kern.log.
>>
>> However, the timeout itself does not cause host reboot, the reboot is
>> probably due to cloudstack HA storage fence, just as Tim mentioned,
>>storage
>> fence is enforced at the script  /opt/cloud/bin/xenheartbeat.sh, you can
>> check for the log entries to confirm if it was the case. If its is
>>really
>> the case you case adjust the cloudstack global settings parameters
>> xenserver.heartbeat.interval and xenserver.heartbeat.timeout to
>>accommodate
>> planned maintenance and even automatic storage side HA, you should check
>> with Netapp for recommend values for your environment, takeover/giveback
>> delays may vary according to controller version and even current
>>controller
>> load, Netapp documentation mention 180 seconds as maximum delay. Also
>>check
>> if script is running correctly #ps -aux | grep heartbeat, it should
>>take 3
>> parameters, if not you may be affected by
>> https://issues.apache.org/jira/browse/CLOUDSTACK-7184.
>>
>> Hope the comments help your decision.
>>
>>
>> Regards,
>> Adriano
>>
>>
>> On Sun, Feb 15, 2015 at 12:38 PM, <cyrano@usp.br> wrote:
>>
>> > FYI
>> >
>> > Sent from my iPhone
>> >
>> > Begin forwarded message:
>> >
>> > *From:* Yiping Zhang <yzhang@marketo.com>
>> > *Date:* February 15, 2015 at 2:00:05 AM GMT-2
>> > *To:* "users@cloudstack.apache.org" <users@cloudstack.apache.org>
>> > *Subject:* *Re: Cloudstack + XenServer 6.2 + NetApp in production*
>> > *Reply-To:* <users@cloudstack.apache.org>
>> >
>> > Tim,
>> >
>> > Thanks, for the reply.
>> >
>> > In our case, the NetApp cluster as a whole did not fail.  The NetApp
>> > cluster failover was happening because Operations team was performing
>>a
>> > scheduled maintenance, this is normal behavior. To best of my
>>knowledge,
>> > NetApp head failover should take anywhere 10-15 seconds.
>> >
>> > As you guessed correctly, our XenServer resource pool does have HA
>> > enabled, and HA shared SR is indeed on the same NetApp cluster as the
>> > primary storage SR.  Though I am not sure if enabling xen pool HA is
>>the
>> > cause of xenserver¹s  rebooting under this particular scenario.
>> >
>> > I am not sure if I understand your statement that "In that case, HA
>>would
>> > detect the storage failure and fence the XenServer host².  Can you
>> > elaborate a little more on this statement?
>> >
>> > Thanks again,
>> >
>> > Yiping
>> >
>> >
>> > On 2/14/15, 6:26 AM, "Tim Mackey" <tmackey@gmail.com> wrote:
>> >
>> > Yiping,
>> >
>> >
>> > The specific problem covered by that note was solved a long time ago.
>> >
>> > Timeouts can be caused by a number of things, and if the entire NetApp
>> >
>> > cluster went offline, the XenServer host would be impacted.  Since you
>> are
>> >
>> > experiencing a host reboot when this happens, I suspect you have
>> XenServer
>> >
>> > HA enabled with the heartbeat on the same NetApp cluster.  In that
>>case,
>> >
>> > HA
>> >
>> > would detect the storage failure and fence the XenServer host.
>> >
>> >
>> > The solution here would be to understand why your NetApp cluster
>>failed
>> >
>> > during scheduled maintenance. Something in your configuration has
>>created
>> >
>> > a
>> >
>> > single point of failure. If you've enabled HA, I also would like to
>> >
>> > understand why you've chosen to do that.  Going slightly commercial
>>for a
>> >
>> > second, I would also advise you to look into a commercial support
>> contract
>> >
>> > for your production XenServer hosts. That team is going to be able to
>>go
>> >
>> > deeper, and much quicker, when production issues arise than this list.
>> >
>> > NetApp and XenServer is used in a very large number of deployments,
>>so if
>> >
>> > there is something wrong they'll be more likely to know. For example,
>> >
>> > there
>> >
>> > could be a set of XenServer or OnTap patches to help sort this out.
>> >
>> >
>> > -tim
>> >
>> >
>> > On Fri, Feb 13, 2015 at 7:36 PM, Yiping Zhang <yzhang@marketo.com>
>> wrote:
>> >
>> >
>> > Hi, all:
>> >
>> >
>> > I am wondering if any one is running their CloudStack in production
>> >
>> > deployments with  XenServer 6.2 + NetApp clusters ?
>> >
>> >
>> > Recently, in our non production deployment (rhel 6.6 + CS 4.3.0 +
>> >
>> > XenServer 6.2 cluster + NetApp cluster), all our XenServer rebooted
>> >
>> > automatically because of NFS timeout, when our NetApp cluster failover
>> >
>> > happened during a scheduled filer maintenance. My google search turned
>> >
>> > up
>> >
>> > this Citrix hot fix: http://support.citrix.com/article/CTX135623 for
>> >
>> > XenServer 6.0.2, and this post about XenServer 6.2:
>> >
>> > http://www.gossamer-threads.com/lists/xen/devel/320020 .
>> >
>> >
>> > Obviously the problem still exists for XenServer 6.2 and we are very
>> >
>> > concerned about going to production deployment based on this
>>technology
>> >
>> > stack.
>> >
>> >
>> > If anyone has a similar setup, please share your experiences.
>> >
>> >
>> > Thanks,
>> >
>> >
>> > Yiping
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>> --
>> Adriano Arantes Paterlini
>> Analista de Sistemas
>> Centro de Tecnologia da Informação - CeTI-SP
>> Superintendência de Tecnologia da Informação - STI
>> Universidade de São Paulo
>>
>> Fone: +55 (11) 3091-0494
>>
>> Av. Professor Luciano Gualberto, 71, tv. 3
>> Cidade Universitária - São Paulo / SP
>>

Mime
View raw message