cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Mabry <dma...@ena.com.INVALID>
Subject Re: CS 4.8 KVM VMs will not live migrate
Date Fri, 02 Feb 2018 03:46:01 GMT
Andrija,

You were right!  The isolation_uri and the broadcast_uri where both blank for the problem
VMs.  Once I corrected the issue, I was able to migrate them inside of CS without issue. 
Thanks for helping me get to the root cause of this issue.  

Thanks,
David Mabry

On 2/1/18, 3:27 PM, "David Mabry" <dmabry@ena.com.INVALID> wrote:

    Andrija,
    
    Thanks for the tip.  I'll check that out and let you know what I find.
    
    Thanks,
    David Mabry
    On 2/1/18, 2:04 PM, "Andrija Panic" <andrija.panic@gmail.com> wrote:
    
        The customer with serial number here :)
        
        So, another issue which I noticed, when you have KVM host disconnections
        (agent disconnect), then in some cases in the cloud.NICs table, there will
        be missing broadcast URI, isolatio_URI and state or similar filed that is
        NULL instead of having correct values for specific NIC of the affected VM.
        
        In this case the VM will not live migrate via ACS (but you can of course
        manually migrate it)...the fix is to fix the NICs table with proper values
        (copy values from other NICs in the same network).
        
        Check if this might be the case...
        
        Cheers
        
        On 31 January 2018 at 15:49, Tutkowski, Mike <Mike.Tutkowski@netapp.com>
        wrote:
        
        > Glad to hear you fixed the issue! :)
        >
        > > On Jan 31, 2018, at 7:16 AM, David Mabry <dmabry@ena.com.INVALID>
wrote:
        > >
        > > Mike and Wei,
        > >
        > > Good news!  I was able to manually live migrate these VMs following the
        > steps outlined below:
        > >
        > > 1.) virsh dumpxml 38 --migratable > 38.xml
        > > 2.) Change the vnc information in 38.xml to match destination host IP
        > and available VNC port
        > > 3.) virsh migrate --verbose --live 38 --xml 38.xml qemu+tcp://
        > destination.host.net/system
        > >
        > > To my surprise, Cloudstack was able to discover and properly handle the
        > fact that this VM was live migrated to a new host without issue.  Very cool.
        > >
        > > Wei, I suspect you are correct when you said this was an issue with the
        > cloudstack agent code.  After digging a little deeper, the agent is never
        > attempting to talk to libvirt at all after prepping the dxml to send to the
        > destination host.  I'm going to attempt to reproduce this in my lab and
        > attach a remote debugger and see if I can get to the bottom of it.
        > >
        > > Thanks again for the help guys!  I really appreciate it.
        > >
        > > Thanks,
        > > David Mabry
        > >
        > > On 1/30/18, 9:55 AM, "David Mabry" <dmabry@ena.com.INVALID> wrote:
        > >
        > >    Ah, understood.  I'll take a closer look at the logs and make sure
        > that I didn't accidentally miss those lines when I pulled together the logs
        > for this email chain.
        > >
        > >    Thanks,
        > >    David Mabry
        > >    On 1/30/18, 8:34 AM, "Wei ZHOU" <ustcweizhou@gmail.com> wrote:
        > >
        > >        Hi David,
        > >
        > >        I encountered the UnsupportAnswer once before, when I made some
        > changes in
        > >        the kvm plugin.
        > >
        > >        Normally there should be some network configurations in the
        > agent.log but I
        > >        do not see it.
        > >
        > >        -Wei
        > >
        > >
        > >        2018-01-30 15:00 GMT+01:00 David Mabry <dmabry@ena.com.invalid>:
        > >
        > >> Hi Wei,
        > >>
        > >> I detached the iso and received the same error.  Just out of curiosity,
        > >> what leads you to believe it is something in the vxlan code?  I guess
at
        > >> this point, attaching a remote debugger to the agent in question might
        > be
        > >> the best way to get to the bottom of what is going on.
        > >>
        > >> Thanks in advance for the help.  I really, really appreciate it.
        > >>
        > >> Thanks,
        > >> David Mabry
        > >>
        > >> On 1/30/18, 3:30 AM, "Wei ZHOU" <ustcweizhou@gmail.com> wrote:
        > >>
        > >>    The answer should be caused by an exception in the cloudstack agent.
        > >>    I tried to migrate a vm in our testing env, it is working.
        > >>
        > >>    there are some different between our env and yours.
        > >>    (1) vlan VS vxlan
        > >>    (2) no ISO VS attached ISO
        > >>    (3) both of us use ceph and centos7.
        > >>
        > >>    I suspect it is caused by codes on vxlan.
        > >>    However, could you detach the ISO and try again ?
        > >>
        > >>    -Wei
        > >>
        > >>
        > >>
        > >>    2018-01-29 19:48 GMT+01:00 David Mabry <dmabry@ena.com.invalid>:
        > >>
        > >>> Good day Cloudstack Devs,
        > >>>
        > >>> I've run across a real head scratcher.  I have two VMs, (initially
3
        > >> VMs,
        > >>> but more on that later) on a single host, that I cannot live migrate
        > >> to any
        > >>> other host in the same cluster.  We discovered this after attempting
        > >> to
        > >>> roll out patches going from CentOS 7.2 to CentOS 7.4.  Initially,
we
        > >>> thought it had something to do with the new version of libvirtd
or
        > >> qemu-kvm
        > >>> on the other hosts in the cluster preventing these VMs from
        > >> migrating, but
        > >>> we are able to live migrate other VMs to and from this host without
        > >> issue.
        > >>> We can even create new VMs on this specific host and live migrate
        > >> them
        > >>> after creation with no issue.  We've put the migration source agent,
        > >>> migration destination agent and the management server in debug and
        > >> don't
        > >>> seem to get anything useful other than "Unsupported command".
        > >> Luckily, we
        > >>> did have one VM that was shutdown and restarted, this is the 3rd
VM
        > >>> mentioned above.  Since that VM has been restarted, it has no issues
        > >> live
        > >>> migrating to any other host in the cluster.
        > >>>
        > >>> I'm at a loss as to what to try next and I'm hoping that someone
out
        > >> there
        > >>> might have had a similar issue and could shed some light on what
to
        > >> do.
        > >>> Obviously, I can contact the customer and have them shutdown their
        > >> VMs, but
        > >>> that will potentially just delay this problem to be solved another
        > >> day.
        > >>> Even if shutting down the VMs is ultimately the solution, I'd still
        > >> like to
        > >>> understand what happened to cause this issue in the first place
with
        > >> the
        > >>> hopes of preventing it in the future.
        > >>>
        > >>> Here's some information about my setup:
        > >>> Cloudstack 4.8 Advanced Networking
        > >>> CentOS 7.2 and 7.4 Hosts
        > >>> Ceph RBD Primary Storage
        > >>> NFS Secondary Storage
        > >>> Instance in Question for Debug: i-532-1392-NSVLTN
        > >>>
        > >>> I have attached relevant debug logs to this email if anyone wishes
        > >> to take
        > >>> a look.  I think the most interesting error message that I have
        > >> received is
        > >>> the following:
        > >>>
        > >>> 468390:2018-01-27 08:59:35,172 DEBUG [c.c.a.t.Request]
        > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802
        > >> ctx-8e7f45ad)
        > >>> (logid:f0888362) Seq 22-942378222027276319: Received:  { Ans: ,
        > >> MgmtId:
        > >>> 14038012703634, via: 22(csh02c01z01.nsvltn.ena.net), Ver: v1,
        > >> Flags: 110,
        > >>> { UnsupportedAnswer } }
        > >>> 468391:2018-01-27 08:59:35,172 WARN  [c.c.a.m.AgentManagerImpl]
        > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802
        > >> ctx-8e7f45ad)
        > >>> (logid:f0888362) Unsupported Command: Unsupported command issued:
        > >>> com.cloud.agent.api.PrepareForMigrationCommand.  Are you sure you
        > >> got the
        > >>> right type of server?
        > >>> 468392:2018-01-27 08:59:35,179 ERROR [c.c.v.VmWorkJobHandlerProxy]
        > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802
        > >> ctx-8e7f45ad)
        > >>> (logid:f0888362) Invocation exception, caused by:
        > >> com.cloud.exception.AgentUnavailableException:
        > >>> Resource [Host:22] is unreachable: Host 22: Unable to prepare for
        > >> migration
        > >>> due to Unsupported command issued: com.cloud.agent.api.
        > >> PrepareForMigrationCommand.
        > >>> Are you sure you got the right type of server?
        > >>> 468393:2018-01-27 08:59:35,179 INFO  [c.c.v.VmWorkJobHandlerProxy]
        > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802
        > >> ctx-8e7f45ad)
        > >>> (logid:f0888362) Rethrow exception com.cloud.exception.
        > >> AgentUnavailableException:
        > >>> Resource [Host:22] is unreachable: Host 22: Unable to prepare for
        > >> migration
        > >>> due to Unsupported command issued: com.cloud.agent.api.
        > >> PrepareForMigrationCommand.
        > >>> Are you sure you got the right type of server?
        > >>>
        > >>> I've tracked this "Unsupported command" down in the CS 4.8 code
to
        > >>> cloudstack/api/src/com/cloud/agent/api/Answer.java which is the
        > >> generic
        > >>> answer class.  I believe where the error is really being spawned
        > >> from is
        > >>> cloudstack/engine/orchestration/src/com/cloud/
        > >>> vm/VirtualMachineManagerImpl.java.  Specifically:
        > >>>        Answer pfma = null;
        > >>>        try {
        > >>>            pfma = _agentMgr.send(dstHostId, pfmc);
        > >>>            if (pfma == null || !pfma.getResult()) {
        > >>>                final String details = pfma != null ?
        > >> pfma.getDetails() :
        > >>> "null answer returned";
        > >>>                final String msg = "Unable to prepare for migration
        > >> due to
        > >>> " + details;
        > >>>                pfma = null;
        > >>>                throw new AgentUnavailableException(msg, dstHostId);
        > >>>            }
        > >>>
        > >>> The pfma returned must be in error or is never returned and therefore
        > >>> still null.  That answer appears that it should be coming from the
        > >>> destination agent, but for the life of me I can't figure out what
        > >> the root
        > >>> cause of this error is beyond, "Unsupported command issued".  What
        > >> command
        > >>> is unsupported?  My guess is that it could be something wrong with
        > >> the dxml
        > >>> that is generated and passed to the destination host, but I have
as
        > >> yet
        > >>> been unable to catch that dxml in debug.
        > >>>
        > >>> Any help or guidance is greatly appreciated.
        > >>>
        > >>> Thanks,
        > >>> David Mabry
        > >>>
        > >>>
        > >>
        > >>
        > >>
        > >
        > >
        > >
        > >
        >
        
        
        
        -- 
        
        Andrija Panić
        
    
    

Mime
View raw message