incubator-cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Musayev, Ilya" <imusa...@webmd.net>
Subject Re: Issues when vCenter becomes unavailable
Date Sat, 23 Feb 2013 23:21:51 GMT
Any chance of some sort of fix for 4.0 or 4.1?

I understand that CS-669 (feature/enhacement) patch missed the commit deadline and will be
in 4.2, but there is a real issue here that impacts production now.

Also, this is not a feature but a bug, I don't know if bugs are also treated on the same schedule
as features.

Technically, for testing - we don't need to fail hypervisors. vMotion would achieve the same
effect and host ID will get out of sync. It's only a theory though.

I will open a bug request on JIRA and ask for some visibility.

Alternatively, we can probably have a hack that will query VC for hosts and vms, identify
what's changed, and update db - I'm just trying to avoid hacks.

Kelven Yang <kelven.yang@citrix.com> wrote:
This is an issue that we are targeting to solve to sync states between
vCenter/Cloudstack in a controllable way. Please track the status of this
ticket for further progress

https://issues.apache.org/jira/browse/CLOUDSTACK-669


Kelven


On 2/22/13 3:51 PM, "Musayev, Ilya" <imusayev@webmd.net> wrote:

>Abit Incomplete email as I was in train and mistakenly press send,
>correction below:.. sorry :)
>
>-----Original Message-----
>From: Musayev, Ilya [mailto:imusayev@webmd.net]
>Sent: Friday, February 22, 2013 6:49 PM
>To: cloudstack-dev@incubator.apache.org;
>cloudstack-users@incubator.apache.org
>Cc: Kelven Yang
>Subject: RE: Issues when vCenter becomes unavailable
>
>Summary:
>
>I have 3 hypervisors
>Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on
>hypervisor 3, however, the host_id in instance table for the VMs are not
>being updated to reflect the only hypervisor alive.
>
>Details:
>
>I physically powered off 2 hypervisors that had most of my VMs and left 1
>online.
>
>The VMs were brought back online by vcenter, however from then on, I
>experience what Dave and Andreas mentioned.
>
>That is, VMWare VMs instances are bound to host id (hypervisor) and not
>vcenter and operations that would be executed on the VMs require for the
>hypervisor to stay up. If the hypervisor goes off line, while VMs still
>come up in VC, CS cannot comprehend that these VMs now live on another
>hypervisor.
>
>This is bad for production roll outs - because VMs are bound to a
>hypervisor ID and not virtual center and it appears its not getting
>updated - though I do see in the log that CS is trying to find it.
>
>Did a little more digging, it looks like the host_ids don't get updated
>in mysql for vm in instances table. I need to double check on this
>because I totally messed 2 of test cloudstack clusters.
>
>Can someone do the following test - if time allows - if not - I can try
>on monday:
>
>1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
>2) Navigate to "host" table in mysql and note the host_id for hypervisor
>that is about to be powered off.
>3) In mysql goto instances table and note the last_host_id and host_id
>for a VM on test crash hypervisor.
>4) Power off the hypervisor and let VCenter bring it back online
>5) Attempt to launch a console on the VM was on crashed hypervisors and
>was powered back on by VC
>6) If it fails - as it did in my case, alter the value of host_id to a
>next hypervisor its living on (my test is not clean because I've ruined
>the cluster that hosts my console vm and don't have time now to work on
>it ATM)
>7) Launch console again to see if the issue resolved
>
>I'm under suspicion the host_id does not get updated as I witnessed by
>examining mysql instance table, but I need to fix my env issues to
>confirm.
>
>Regards
>ilya
>
>
>-----Original Message-----
>From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
>Sent: Friday, February 22, 2013 3:41 PM
>To: cloudstack-users@incubator.apache.org
>Cc: Kelven Yang; CloudStack DeveloperList
>Subject: Re: Issues when vCenter becomes unavailable
>
>CC'ing Kelven to see if he has any ideas.
>
>On 2/22/13 12:22 PM, "Dave Dunaway" <dave.dunaway@gmail.com> wrote:
>
>>If I may suggest also testing a disconnect of a host (hypervisor) from
>>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk
>>to the hosts (hypervisors). CS marks the host as down or failed or
>>whatever.
>>
>>When the host comes back up vcenter can it just fine and all seems good.
>>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>>5.0)
>>when CS tries to talk to vcenter and the previously disconnected host
>>(that is now recovered).
>>
>>What we experienced was that we had to migrate all guests off the
>>recovered host, and then destroy that host in CS, and re-create it.
>>Then we could migrate back onto it the guests which had been previously
>>migrated.
>>
>>The curious thing is that while CS did not want to send commands to the
>>host (it kept on saying host id=X has timedout when whatever command
>>was sent to it), CS WAS polling the host for resources and getting the
>>correct numbers.... so CS could in some ways talk to the host (ie: it
>>knew the capabilities, number of VMs on it, etc).
>>
>>Luckily for me this all happened in a test environment. In production,
>>this would have been a real nightmare!
>>
>>
>>dave
>>
>>
>>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <imusayev@webmd.net>
>>wrote:
>>
>>> Andi
>>>
>>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a
>>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
>>>That in turn  made VC unreachable by CS.
>>>
>>> I then began executing commands and sure enough commands failed or
>>> backlogged. Once I restored VC connectivity, the backlogged commands
>>> executed and I did not experience any abnormalities.
>>>
>>> I will redo this test and leave VC off for an hour - maybe a need a
>>>longer  outage.
>>>
>>> Regards
>>> ilya
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Musayev, Ilya
>>> Sent: Thursday, February 21, 2013 2:43 PM
>>> To: cloudstack-users@incubator.apache.org
>>> Subject: RE: Issues when vCenter becomes unavailable
>>>
>>> This is definitely not the behavior we want with vcenter.
>>>
>>> I will test this out on my lab setup shortly.
>>>
>>> Thanks
>>> ilya
>>>
>>> -----Original Message-----
>>> From: Chip Childers [mailto:chip.childers@sungard.com]
>>> Sent: Thursday, February 21, 2013 9:40 AM
>>> To: cloudstack-users@incubator.apache.org
>>> Subject: Re: Issues when vCenter becomes unavailable
>>>
>>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>>> > Andreas,
>>> >
>>> > The open source community doesn't support the Citrix version 3.0.6.
>>> > You need to report this via your Citrix Support contract. Sounds
>>> > like this could be a bug.
>>> >
>>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I
>>> > don't know if this test case has been explored.
>>>
>>> Thx - I forwarded to cs-dev@i.a.o to get the test engineers in the
>>> community to take a look.
>>>
>>> >
>>> > Thanks,
>>> > Matt Mullins
>>> > CloudPlatform Implementation Engineer Worldwide Cloud Services
>>> > Citrix System, Inc.
>>> > +1 (407) 920-1107  Office/Cell Phone
>>> > matt.mullins@citrix.com
>>> >
>>> >
>>> >
>>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>>> > <Andreas.Fuchs@swisstxt.ch> wrote:
>>> >
>>> > >Hi CS Users
>>> > >
>>> > >We are running CS 3.0.6 on a vSphere platform and found a strange
>>> > >behavior.
>>> > >
>>> > >When the vCenter becomes unavailable due to a reboot or some other
>>> > >issue, it seems that CS is shutting down instances when vCenter
>>> > >becomes available again.
>>> > >
>>> > >What we think what happens.
>>> > >1. vCenter becomes unrechabale
>>> > >2. CS marks the ESX servers as "down"
>>> > >3. We think this leads to: CS marks the instances as down as well 4.
>>> > >When vCenter becomes available again, CS stops the "marked as down"
>>> > >instances
>>> > >
>>> > >This is very bad as the Instances where running all the time and
>>> > >the the shutdown issued by CS is forcing a service interruption.
>>> > >
>>> > >My problem is that I cannot realy reporoduce as allot of testing
>>> > >is ongoing on the platform at the moment, so my question:
>>> > >
>>> > >Does someone else see this issue as well and can maybe reproduce?
>>> > >Is there a workaround to it, can I change some flag or something
>>> > >which tells CS to never shut down an instance by himself?
>>> > >Why are the ESX hosts getting marked as down and not unreachable
>>> > >or something?
>>> > >
>>> > >Best regards
>>> > >Andi
>>> >
>>> >
>>>
>>>
>>>
>
>
>
>
>



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message