cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sriharsha work <sriharsha.w...@gmail.com>
Subject Re: Help! After network outage, can't start System VMs; focused debug info attached
Date Tue, 17 Sep 2013 10:03:19 GMT
Does Reconfiguring CloudStack to use local storage for system VMs retain
all the VMs that are already in Cloudstack. There are about 200 vms running
in our cloudstack. How about the VM templates, snapshots and all other
stuff that were already in Cloudstack.

I mean would it still restore the Cloudstack's original behavior as we had
it before our maintenance. Also what are the disadvantages of using local
storage for System VMs.

Thanks
Sriharsha.

On Tue, Sep 17, 2013 at 2:52 AM, Kirk Kosinski <kirkkosinski@gmail.com>wrote:

> Okay, so system VMs are using NFS primary storage (I mis-read the OP,
> sorry).  Make sure the KVM hosts can mount and write to:
>
> 10.42.1.101:/srv/nfs/eng/cs-primary
>
> Also check libvirtd.log for errors.
>
> If you're not making progress and want to get up and running ASAP, try
> reconfiguring CloudStack to use local storage for system VMs and
> (assuming this works) sort out the NFS primary storage problem later.
>
> Best regards,
> Kirk
>
> On 09/17/2013 02:22 AM, sriharsha work wrote:
> > Hi Kirk,
> >
> > Thanks for your reply. This is a blocker for us and currently affected
> > all of our work. It is very helpful to debug more into the issue. I have
> > a question.
> >
> > 1. What should the directory be when mounting [2] systemVM template
> > location on the nfs drive.
> >
> >
> > Error from agent.log on the host. Clearly it says some issue with the
> > libvirt pools. Can you please help me understand if anything else needs
> > to be addressed to get the issue resolved.
> >
> >
> > 2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-3:null) Request:Seq 14-1592393816:  { Cmd ,
> > MgmtId: 161340856362, via: 14, Ver: v1, Flags: 100111,
> > [{"storage.CreateCommand":{"vo
> >
> lId":9817,"pool":{"id":201,"uuid":"9c6fd9a3-43e5-389a-9594-faecf178b4b9","host":"10.42.1.101","path":"/srv/nfs/eng/cs-primary","port":2049,"type":"NetworkFilesystem"},"diskCharacteristics":{"size":725811200,"tags":[],"type":"ROOT","name":"ROOT-9736","useLocalStorage":false,"recreatable":true,"diskOfferingId":7,"volumeId":9817,"hyperType":"KVM"},"templateUrl":"f23a16e7-b628-429e-83e1-698935588465","wait":0}}]
> > }
> > 2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-3:null) Processing command:
> > com.cloud.agent.api.storage.CreateCommand
> > 2013-09-17 02:17:36,779 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-3:null) Failed to create volume:
> > com.cloud.utils.exception.CloudRuntimeException:
> > org.libvirt.LibvirtException: Storage volume not found: no storage vol
> > with matching name 'f23a16e7-b628-429e-83e1-698935588465'
> > 2013-09-17 02:17:36,781 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-3:null) Seq 14-1592393816:  { Ans: , MgmtId:
> > 161340856362, via: 14, Ver: v1, Flags: 110,
> >
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
> > com.cloud.utils.exception.CloudRuntimeException\nMessage:
> > org.libvirt.LibvirtException: Storage volume not found: no storage vol
> > with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
> > com.cloud.utils.exception.CloudRuntimeException:
> > org.libvirt.LibvirtException: Storage volume not found: no storage vol
> > with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
> >
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
> >
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
> > com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
> > com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
> > com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
> > java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
> > 2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-4:null) Request:Seq 14-1592393817:  { Cmd ,
> > MgmtId: 161340856362, via: 14, Ver: v1, Flags: 100111,
> > [{"StopCommand":{"isProxy":false,"vmName":"s-9736-VM","wait":0}}] }
> > 2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-4:null) Processing command:
> > com.cloud.agent.api.StopCommand
> > 2013-09-17 02:17:36,891 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-4:null) Failed to get dom xml:
> > org.libvirt.LibvirtException: Domain not found: no domain with matching
> > uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
> > 2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-4:null) Failed to get dom xml:
> > org.libvirt.LibvirtException: Domain not found: no domain with matching
> > uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
> > 2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-4:null) Try to stop the vm at first
> > 2013-09-17 02:17:36,895 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-4:null) Failed to stop VM :s-9736-VM :
> > org.libvirt.LibvirtException: Domain not found: no domain with matching
> > uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
> >         at org.libvirt.ErrorHandler.processError(Unknown Source)
> >         at org.libvirt.Connect.processError(Unknown Source)
> >         at org.libvirt.Connect.domainLookupByUUIDString(Unknown Source)
> >         at org.libvirt.Connect.domainLookupByUUID(Unknown Source)
> >         at
> >
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(LibvirtComputingResource.java:4023)
> >         at
> > com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(Libvi
> >
> >
> > Thanks
> > Sriharsha.
> >
> >
> > On Tue, Sep 17, 2013 at 1:41 AM, Kirk Kosinski <kirkkosinski@gmail.com
> > <mailto:kirkkosinski@gmail.com>> wrote:
> >
> >     Hi, here is the error:
> >
> >     2013-09-16 15:08:17,168 DEBUG [agent.transport.Request]
> >     (AgentManager-Handler-5:null) Seq 13-931004532: Processing:  { Ans: ,
> >     MgmtId: 161340856362, via: 13, Ver: v1, Flags: 110,
> >
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
> >     com.cloud.utils.exception.CloudRuntimeException\nMessage:
> >     org.libvirt.LibvirtException: Storage volume not found: no storage
> vol
> >     with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
> >     com.cloud.utils.exception.CloudRuntimeException:
> >     org.libvirt.LibvirtException: Storage volume not found: no storage
> vol
> >     with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
> >
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
> >
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
> >     com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
> >
> com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
> >     com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
> >     java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
> >
> >     I'm not certain what volume it is complaining about, but I suspect
> >     secondary storage.  Log on to a host (in particular host 13 [1]
> since it
> >     is confirmed to suffer from the issue) and try to manually mount the
> >     full path of the directory with the system VM template of the
> secondary
> >     storage NFS share [2].  The idea is to confirm the share and
> >     subdirectories of the share are mountable.  Maybe during the
> maintenance
> >     some hosts changed IPs and/or the secondary storage NFS share
> >     permissions (or other settings) were messed up.
> >
> >     If the mount doesn't work, fix whatever is causing it.  If it does
> work,
> >     please collect additional info.  Enable DEBUG logging on the hosts
> [3]
> >     (if necessary), wait for the error to occur, and upload the agent.log
> >     from the host with the error.  It should have more details besides
> the
> >     exception shown in the management-server.log.  If you have a lot of
> >     hosts and don't want to enable DEBUG logging on every one,
> temporarily
> >     disable most of them and do it on the remaining few.
> >
> >     Best regards,
> >     Kirk
> >
> >     [1] "13" is the id of the host in the CloudStack database, so find
> out
> >     which host it is with:
> >     select * from `cloud`.`host` where id = 13 \G
> >
> >     [2] Something like:
> >     nfshost:/share/template/tmpl/2/123
> >
> >     [3] In /etc/cloudstack/agent/log4j-cloud.xml, set the Threshold for
> FILE
> >     and com.cloud to DEBUG.  Depending on the CloudStack version, it may
> or
> >     may not be enabled by default, and the path may be /etc/cloud/agent/.
> >
> >
> >     On 09/16/2013 07:36 PM, sriharsha work wrote:
> >     > Replying on behalf of Matt. We are able to write data to the Nfs
> >     drives.
> >     > That's not an issue.
> >     >
> >     > Thanks
> >     > Sriharsha
> >     >
> >     > Sent from my iPhone
> >     >
> >     >> On Sep 16, 2013, at 19:30, Ahmad Emneina <aemneina@gmail.com
> >     <mailto:aemneina@gmail.com>> wrote:
> >     >>
> >     >> Try to mount your primary storage to a compute host and try to
> >     write to it.
> >     >> Your NFS server might not have come back up properly
> >     (settings-wise or all
> >     >> the relevant services).
> >     >>> On Sep 16, 2013 6:08 PM, "Matt Foley" <mfoley@hortonworks.com
> >     <mailto:mfoley@hortonworks.com>> wrote:
> >     >>>
> >     >>> Thank you Chiradeep.  Log snippet now available as
> >     http://apaste.info/qBIB
> >     >>> --Matt
> >     >>>
> >     >>> On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
> >     >>> Chiradeep.Vittal@citrix.com
> >     <mailto:Chiradeep.Vittal@citrix.com>> wrote:
> >     >>>
> >     >>>> Attachments are stripped. Can you paste (say at
> >     http://apaste.info/)
> >     >>>>
> >     >>>> From: Matt Foley <mfoley@hortonworks.com
> >     <mailto:mfoley@hortonworks.com>>
> >     >>>> Date: Monday, September 16, 2013 4:58 PM
> >     >>>>
> >     >>>> We had a planned network outage this weekend, which
> inadvertently
> >     >>> resulted
> >     >>>> in making the NFS Shared Primary Storage (used by System VMs)
> >     unavailable
> >     >>>> for a day and a half.  (Guest VMs use local storage only, but
> >     System VMs
> >     >>>> use shared storage only.)  Cloudstack was not brought down
> >     prior to the
> >     >>>> outage.
> >     >>>>
> >     >>>> After network came back, we gracefully brought down all services
> >     >>> including
> >     >>>> cloudstack-management, mysql, and NFS, then actually rebooted
> >     all servers
> >     >>>> in the cluster and the NFS server (to make sure no stale file
> >     handles),
> >     >>>> then brought up services in the appropriate order.  Also
> >     checked mysql
> >     >>> for
> >     >>>> table corruption, and found none.  Confirmed that the NFS
> >     volumes are
> >     >>>> mountable from all hosts, and in fact Shared Primary Storage
is
> >     being
> >     >>>> mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
> >     >>>>
> >     >>>> Nevertheless, when try to bring up the cluster, we fail to
> >     start the
> >     >>>> system VMs, with errors "InsufficientServerCapacityException:
> >     Unable to
> >     >>>> create a deployment for VM".  The cause is not really
> insufficient
> >     >>>> capacity, as actual usage of resources is tiny; these error
> >     messages are
> >     >>>> false explanations of the failure to create primary storage
> >     volume for
> >     >>> the
> >     >>>> System VMs.
> >     >>>>
> >     >>>> Digging into management-server.log, the core issue seems to
be
> >     the ~160
> >     >>>> line snippet from the log attached to this message as
> >     >>>> cloudstack_debug_2013.09.16.log. The only Shared Primary
> >     Storage pool is
> >     >>>> pool 201, named "cs-primary".  It is mounted on all hosts as
> >     >>>> /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.
> >      The log
> >     >>>> shows the management server correctly identifying a particular
> >     host as
> >     >>>> being able to access pool 201, then trying to allocate a
> >     primary storage
> >     >>>> volume using the template with uuid
> >     f23a16e7-b628-429e-83e1-698935588465.
> >     >>>> It fails, but I cannot tell why.  I suspect its claim that
> >     "Template 3
> >     >>> has
> >     >>>> already been downloaded to pool 201" is false, but I don't
know
> >     how to
> >     >>>> check this (or fix if wrong).
> >     >>>>
> >     >>>> Any guidance for further debugging or fixing this would be
> GREATLY
> >     >>>> appreciated.
> >     >>>> Thanks,
> >     >>>> --Matt
> >     >>>
> >     >>> --
> >     >>> CONFIDENTIALITY NOTICE
> >     >>> NOTICE: This message is intended for the use of the individual
> >     or entity to
> >     >>> which it is addressed and may contain information that is
> >     confidential,
> >     >>> privileged and exempt from disclosure under applicable law. If
> >     the reader
> >     >>> of this message is not the intended recipient, you are hereby
> >     notified that
> >     >>> any printing, copying, dissemination, distribution, disclosure
or
> >     >>> forwarding of this communication is strictly prohibited. If you
> have
> >     >>> received this communication in error, please contact the sender
> >     immediately
> >     >>> and delete it from your system. Thank You.
> >     >>>
> >
> >
> >
> >
> > --
> > Thanks & Regards
> > Sriharsha Devineni
>



-- 
Thanks & Regards
Sriharsha Devineni

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message