Mailing-List: contact users-help@cloudstack.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@cloudstack.apache.org
Received-SPF: pass (nike.apache.org: domain of yzhang@marketo.com designates
 173.203.187.101 as permitted sender)
From: Yiping Zhang <yzhang@marketo.com>
To: "users@cloudstack.apache.org" <users@cloudstack.apache.org>
Date: Sat, 14 Feb 2015 22:00:05 -0600
Subject: Re: Cloudstack + XenServer 6.2 + NetApp in production
Thread-Topic: Cloudstack + XenServer 6.2 + NetApp in production
Thread-Index: AdBI0+LtzCSXLy4zR8aH6NAT+Hjtbw==
Message-ID: <D1055566.290E9%yzhang@marketo.com>
In-Reply-To: 
 <CAJGXtBN9hDrESHSYbSYq1DhziK1FGrZDwO34psH8Q=_4806JHw@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.2.3.120616
acceptlanguage: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Tim,

Thanks, for the reply.

In our case, the NetApp cluster as a whole did not fail.  The NetApp
cluster failover was happening because Operations team was performing a
scheduled maintenance, this is normal behavior. To best of my knowledge,
NetApp head failover should take anywhere 10-15 seconds.

As you guessed correctly, our XenServer resource pool does have HA
enabled, and HA shared SR is indeed on the same NetApp cluster as the
primary storage SR.  Though I am not sure if enabling xen pool HA is the
cause of xenserver=B9s  rebooting under this particular scenario.

I am not sure if I understand your statement that "In that case, HA would
detect the storage failure and fence the XenServer host=B2.  Can you
elaborate a little more on this statement?

Thanks again,

Yiping


On 2/14/15, 6:26 AM, "Tim Mackey" <tmackey@gmail.com> wrote:

>Yiping,
>
>The specific problem covered by that note was solved a long time ago.
>Timeouts can be caused by a number of things, and if the entire NetApp
>cluster went offline, the XenServer host would be impacted.  Since you are
>experiencing a host reboot when this happens, I suspect you have XenServer
>HA enabled with the heartbeat on the same NetApp cluster.  In that case,
>HA
>would detect the storage failure and fence the XenServer host.
>
>The solution here would be to understand why your NetApp cluster failed
>during scheduled maintenance. Something in your configuration has created
>a
>single point of failure. If you've enabled HA, I also would like to
>understand why you've chosen to do that.  Going slightly commercial for a
>second, I would also advise you to look into a commercial support contract
>for your production XenServer hosts. That team is going to be able to go
>deeper, and much quicker, when production issues arise than this list.
>NetApp and XenServer is used in a very large number of deployments, so if
>there is something wrong they'll be more likely to know. For example,
>there
>could be a set of XenServer or OnTap patches to help sort this out.
>
>-tim
>
>On Fri, Feb 13, 2015 at 7:36 PM, Yiping Zhang <yzhang@marketo.com> wrote:
>
>> Hi, all:
>>
>> I am wondering if any one is running their CloudStack in production
>> deployments with  XenServer 6.2 + NetApp clusters ?
>>
>> Recently, in our non production deployment (rhel 6.6 + CS 4.3.0 +
>> XenServer 6.2 cluster + NetApp cluster), all our XenServer rebooted
>> automatically because of NFS timeout, when our NetApp cluster failover
>> happened during a scheduled filer maintenance. My google search turned
>>up
>> this Citrix hot fix: http://support.citrix.com/article/CTX135623 for
>> XenServer 6.0.2, and this post about XenServer 6.2:
>> http://www.gossamer-threads.com/lists/xen/devel/320020 .
>>
>> Obviously the problem still exists for XenServer 6.2 and we are very
>> concerned about going to production deployment based on this technology
>> stack.
>>
>> If anyone has a similar setup, please share your experiences.
>>
>> Thanks,
>>
>> Yiping
>>
>>
>>