cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Young <iyo...@ratespecial.com>
Subject Re: console proxy times out
Date Fri, 23 May 2014 17:42:27 GMT
Also, is this normal?  Every time the server is rebooted, it adds another
record to the mshost table but the "removed" field is always NULL.

http://pastebin.com/q5zDCu4b


On Fri, May 23, 2014 at 10:39 AM, Ian Young <iyoung@ratespecial.com> wrote:

> The SSVM is stopped.  If I try to start it, it complains about
> insufficient capacity.  CPU?  RAM?  I have plenty of both available.
>
> 2014-05-23 10:36:51,196 DEBUG [c.c.d.FirstFitPlanner]
> (Job-Executor-2:ctx-ababac38 ctx-0906d3c3) Listing clusters in order of
> aggregate capacity, that have (atleast one host with) enough CPU and RAM
> capacity under this Pod: 1
> 2014-05-23 10:36:51,198 DEBUG [c.c.d.FirstFitPlanner]
> (Job-Executor-2:ctx-ababac38 ctx-0906d3c3) Removing from the clusterId list
> these clusters from avoid set: [1]
> 2014-05-23 10:36:51,198 DEBUG [c.c.d.FirstFitPlanner]
> (Job-Executor-2:ctx-ababac38 ctx-0906d3c3) No clusters found after removing
> disabled clusters and clusters in avoid list, returning.
> 2014-05-23 10:36:51,201 DEBUG [c.c.c.CapacityManagerImpl]
> (Job-Executor-2:ctx-ababac38 ctx-0906d3c3) VM state transitted from
> :Starting to Stopped with event: OperationFailedvm's original host id: 1
> new host id: null host id before state transition: null
> 2014-05-23 10:36:51,201 WARN  [c.c.s.s.SecondaryStorageManagerImpl]
> (Job-Executor-2:ctx-ababac38 ctx-0906d3c3) Exception while trying to start
> secondary storage vm
> com.cloud.exception.InsufficientServerCapacityException: Unable to create
> a deployment for VM[SecondaryStorageVm|s-1-VM]Scope=interface
> com.cloud.dc.DataCenter; id=1
>
>
> On Fri, May 23, 2014 at 10:35 AM, Ian Young <iyoung@ratespecial.com>wrote:
>
>> I rebooted it and now it's in an even more broken state.  It's repeatedly
>> trying to stop the console proxy but can't because its state is "Starting."
>>  Here is an excerpt from the management log:
>>
>> http://pastebin.com/FiaDzKXb
>>
>> The agent log keeps repeating these messages:
>>
>> http://pastebin.com/yDidSbrz
>>
>> What's wrong with it?
>>
>>
>> On Thu, May 22, 2014 at 12:55 PM, Ian Young <iyoung@ratespecial.com>wrote:
>>
>>> I wonder if something is wrong with the NFS mount.  I see this error
>>> periodically in /var/log/messages even though I have set the Domain in
>>> /etc/idmapd.conf to the host's FQDN:
>>>
>>> May 20 19:30:22 virthost1 rpc.idmapd[1790]: nss_getpwnam: name '0' does
>>> not map into domain 'redacted.com'
>>> May 20 19:36:20 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does
>>> not map into domain 'redacted.com'
>>> May 20 19:44:35 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does
>>> not map into domain 'redacted.com'
>>> May 21 10:21:25 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does
>>> not map into domain 'redacted.com'
>>> May 21 12:46:40 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does
>>> not map into domain 'redacted.com'
>>> May 21 13:52:42 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does
>>> not map into domain 'redacted.com'
>>> May 21 13:55:20 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '107'
>>> does not map into domain 'redacted.com'
>>> May 21 20:31:51 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '107'
>>> does not map into domain 'redacted.com'
>>> May 22 10:14:18 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does
>>> not map into domain 'redacted.com'
>>> May 22 10:18:40 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '107'
>>> does not map into domain 'redacted.com'
>>> May 22 10:19:23 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does
>>> not map into domain 'redacted.com'
>>> May 22 10:25:16 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '107'
>>> does not map into domain 'redacted.com'
>>>
>>> name '107' just started appearing in the log yesterday, which looks
>>> unusual.  Up until then, the error was always name '0'.
>>>
>>>
>>> On Thu, May 22, 2014 at 11:15 AM, Andrija Panic <andrija.panic@gmail.com
>>> > wrote:
>>>
>>>> I have observed this kind of problems ("process blocked for more than xx
>>>> sec...") when I had access with storage - check your disks,  smartctl
>>>> etc...
>>>> best
>>>>
>>>> Sent from Google Nexus 4
>>>> On May 22, 2014 7:49 PM, "Ian Young" <iyoung@ratespecial.com> wrote:
>>>>
>>>> > And this is in /var/log/messages right before that event:
>>>> >
>>>> > May 22 10:16:07 virthost1 kernel: INFO: task qemu-kvm:2971 blocked
>>>> for more
>>>> > than 120 seconds.
>>>> > May 22 10:16:07 virthost1 kernel:      Not tainted
>>>> > 2.6.32-431.11.2.el6.x86_64 #1
>>>> > May 22 10:16:07 virthost1 kernel: "echo 0 >
>>>> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> > May 22 10:16:07 virthost1 kernel: qemu-kvm      D 0000000000000002
>>>>   0
>>>> >  2971      1 0x00000080
>>>> > May 22 10:16:07 virthost1 kernel: ffff8810724e9be8 0000000000000082
>>>> > 0000000000000000 ffff88106b6529d8
>>>> > May 22 10:16:07 virthost1 kernel: ffff880871b3e8d8 ffff880871b3e8f0
>>>> > ffffffff8100bb8e ffff8810724e9be8
>>>> > May 22 10:16:07 virthost1 kernel: ffff881073525058 ffff8810724e9fd8
>>>> > 000000000000fbc8 ffff881073525058
>>>> > May 22 10:16:07 virthost1 kernel: Call Trace:
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8100bb8e>] ?
>>>> > apic_timer_interrupt+0xe/0x20
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff810555ef>] ?
>>>> > mutex_spin_on_owner+0x9f/0xc0
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8152969e>]
>>>> > __mutex_lock_slowpath+0x13e/0x180
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8152953b>]
>>>> mutex_lock+0x2b/0x50
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffffa021c2cf>]
>>>> > memory_access_ok+0x7f/0xc0 [vhost_net]
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffffa021d89c>]
>>>> > vhost_dev_ioctl+0x2ec/0xa50 [vhost_net]
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffffa021c411>] ?
>>>> > vhost_work_flush+0xe1/0x120 [vhost_net]
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8122db91>] ?
>>>> > avc_has_perm+0x71/0x90
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffffa021f11a>]
>>>> > vhost_net_ioctl+0x7a/0x5d0 [vhost_net]
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8122f914>] ?
>>>> > inode_has_perm+0x54/0xa0
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffffa01a28b7>] ?
>>>> > kvm_vcpu_ioctl+0x1e7/0x580 [kvm]
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8108b14e>] ?
>>>> > send_signal+0x3e/0x90
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8119dc12>]
>>>> vfs_ioctl+0x22/0xa0
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8119ddb4>]
>>>> > do_vfs_ioctl+0x84/0x580
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8119e331>]
>>>> sys_ioctl+0x81/0xa0
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff810e1e4e>] ?
>>>> > __audit_syscall_exit+0x25e/0x290
>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8100b072>]
>>>> > system_call_fastpath+0x16/0x1b
>>>> >
>>>> >
>>>> > On Thu, May 22, 2014 at 10:39 AM, Ian Young <iyoung@ratespecial.com>
>>>> > wrote:
>>>> >
>>>> > > The console proxy became unavailable again yesterday afternoon.
 I
>>>> could
>>>> > > SSH into it via its link local address and nothing seemed to be
>>>> wrong
>>>> > > inside the VM itself.  However, the qemu-kvm process for that VM
>>>> was at
>>>> > > almost 100% CPU.  Inside the VM, the CPU usage was minimal and
the
>>>> java
>>>> > > process was running and listening on port 443.  So there seems
to be
>>>> > > something wrong with it down at the KVM/QEMU level.  It's weird
how
>>>> this
>>>> > > keeps happening to the console proxy only and not any of the other
>>>> VMs.
>>>> >  I
>>>> > > tried to reboot it from the management UI and after about 15
>>>> minutes, it
>>>> > > finally did.  Now the console proxy is working but I don't know
how
>>>> long
>>>> > it
>>>> > > will last before it breaks again.  I found this in libvirtd.log,
>>>> which
>>>> > > corresponds with the time the console proxy rebooted:
>>>> > >
>>>> > > 2014-05-22 17:17:04.362+0000: 25195: info : libvirt version: 0.10.2,
>>>> > > package: 29.el6_5.7 (CentOS BuildSystem <http://bugs.centos.org>,
>>>> > > 2014-04-07-07:42:04, c6b9.bsys.dev.centos.org)
>>>> > > 2014-05-22 17:17:04.362+0000: 25195: error : qemuMonitorIO:614
:
>>>> internal
>>>> > > error End of file from monitor
>>>> > >
>>>> > >
>>>> > > On Wed, May 21, 2014 at 2:07 PM, Ian Young <iyoung@ratespecial.com>
>>>> > wrote:
>>>> > >
>>>> > >> I built and installed a libvirt 1.04 package from the Fedora
src
>>>> rpm.
>>>> >  It
>>>> > >> installed fine inside a test VM but installing it on the real
>>>> hypervisor
>>>> > >> was a bad idea and I doubt I'll be pursuing it further.  All
VMs
>>>> > promptly
>>>> > >> stopped and this appeared in libvirtd.log:
>>>> > >>
>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: info : libvirt version:
1.0.4,
>>>> > >> package: 1.el6 (Unknown, 2014-05-21-11:36:09, redacted.com)
>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning :
>>>> virDriverLoadModule:72 :
>>>> > >> Module
>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_network.so
>>>> > not
>>>> > >> accessible
>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning :
>>>> virDriverLoadModule:72 :
>>>> > >> Module
>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_storage.so
>>>> > not
>>>> > >> accessible
>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning :
>>>> virDriverLoadModule:72 :
>>>> > >> Module
>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_nodedev.so
>>>> > not
>>>> > >> accessible
>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning :
>>>> virDriverLoadModule:72 :
>>>> > >> Module
>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_secret.so not
>>>> > >> accessible
>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning :
>>>> virDriverLoadModule:72 :
>>>> > >> Module
>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_nwfilter.so
>>>> > not
>>>> > >> accessible
>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning :
>>>> virDriverLoadModule:72 :
>>>> > >> Module
>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_interface.so
>>>> > not
>>>> > >> accessible
>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning :
>>>> virDriverLoadModule:72 :
>>>> > >> Module /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so
>>>> not
>>>> > >> accessible
>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning :
>>>> virDriverLoadModule:72 :
>>>> > >> Module /usr/lib64/libvirt/connection-driver/libvirt_driver_lxc.so
>>>> not
>>>> > >> accessible
>>>> > >> 2014-05-21 20:36:49.471+0000: 23570: error : do_open:1220 :
no
>>>> > connection
>>>> > >> driver available for qemu:///system
>>>> > >> 2014-05-21 20:36:49.472+0000: 23567: error :
>>>> virNetSocketReadWire:1370 :
>>>> > >> End of file while reading data: Input/output error
>>>> > >> 2014-05-21 20:36:49.473+0000: 23571: error : do_open:1220 :
no
>>>> > connection
>>>> > >> driver available for lxc:///
>>>> > >> 2014-05-21 20:36:49.474+0000: 23567: error :
>>>> virNetSocketReadWire:1370 :
>>>> > >> End of file while reading data: Input/output error
>>>> > >> 2014-05-21 20:36:49.475+0000: 23568: error : do_open:1220 :
no
>>>> > connection
>>>> > >> driver available for qemu:///system
>>>> > >> 2014-05-21 20:36:49.476+0000: 23567: error :
>>>> virNetSocketReadWire:1370 :
>>>> > >> End of file while reading data: Input/output error
>>>> > >> 2014-05-21 20:36:49.678+0000: 23575: error : do_open:1220 :
no
>>>> > connection
>>>> > >> driver available for qemu:///system
>>>> > >> 2014-05-21 20:36:49.678+0000: 23567: error :
>>>> virNetSocketReadWire:1370 :
>>>> > >> End of file while reading data: Input/output error
>>>> > >> 2014-05-21 20:36:49.681+0000: 23572: error : do_open:1220 :
no
>>>> > connection
>>>> > >> driver available for qemu:///system
>>>> > >> 2014-05-21 20:36:49.682+0000: 23567: error :
>>>> virNetSocketReadWire:1370 :
>>>> > >> End of file while reading data: Input/output error
>>>> > >>
>>>> > >>
>>>> > >> On Wed, May 21, 2014 at 10:45 AM, Ian Young <
>>>> iyoung@ratespecial.com
>>>> > >wrote:
>>>> > >>
>>>> > >>> I was able to get it working by following these steps:
>>>> > >>>
>>>> > >>> 1. stop all instances
>>>> > >>> 2. service cloudstack-management stop
>>>> > >>> 3. service cloudstack-agent stop
>>>> > >>> 4. virsh shutdown {domain} (for each of the system VMs)
>>>> > >>> 5. service libvirtd stop
>>>> > >>> 6. umount primary and secondary
>>>> > >>> 7. reboot
>>>> > >>>
>>>> > >>> The console proxy is working again.  I expect it will probably
>>>> break
>>>> > >>> again in a day or two.  I have a feeling it's a result
of this
>>>> libvirtd
>>>> > >>> bug, since I've seen the "cannot acquire state change lock"
>>>> several
>>>> > times.
>>>> > >>>
>>>> > >>> https://bugs.launchpad.net/nova/+bug/1254872
>>>> > >>>
>>>> > >>> I might try building my own libvirtd 1.0.3 for EL6.
>>>> > >>>
>>>> > >>>
>>>> > >>> On Tue, May 20, 2014 at 6:21 PM, Ian Young <
>>>> iyoung@ratespecial.com
>>>> > >wrote:
>>>> > >>>
>>>> > >>>> So I got the console proxy working via HTTPS (by managing
my own
>>>> "
>>>> > >>>> realhostip.com" DNS) last week and everything was working
fine.
>>>> > >>>>  Today, all of a sudden, the console proxy stopped
working
>>>> again.  The
>>>> > >>>> browser says, "Connecting to 192-168-100-159.realhostip.com..."
>>>> and
>>>> > >>>> eventually times out.  I tried to restart it and it
went into a
>>>> > "Stopping"
>>>> > >>>> state that never completed and the Agent State was
>>>> "Disconnected."  I
>>>> > could
>>>> > >>>> not shut down the VM using virsh or with "kill -9"
because
>>>> libvirtd
>>>> > kept
>>>> > >>>> saying, "cannot acquire state change lock," so I gracefully
shut
>>>> down
>>>> > the
>>>> > >>>> remaining instances and rebooted the entire management
>>>> > server/hypervisor.
>>>> > >>>>  Start over.
>>>> > >>>>
>>>> > >>>> When it came back up, the SSVM and console proxy started
but the
>>>> > >>>> virtual router was stopped.  I was able to manually
start it
>>>> from the
>>>> > UI.
>>>> > >>>>  The console proxy still times out when I try to access
it from a
>>>> > browser.
>>>> > >>>>  I don't see any errors in the management or agent
logs, just
>>>> this:
>>>> > >>>>
>>>> > >>>> 2014-05-20 18:04:27,632 DEBUG [c.c.a.t.Request]
>>>> > (catalina-exec-10:null)
>>>> > >>>> Seq 1-2130378876: Sending  { Cmd , MgmtId: 55157049428734,
via:
>>>> 1(
>>>> > >>>> virthost1.redacted.com), Ver: v1, Flags: 100011,
>>>> > >>>>
>>>> >
>>>> [{"com.cloud.agent.api.GetVncPortCommand":{"id":4,"name":"r-4-VM","wait":0}}]
>>>> > >>>> }
>>>> > >>>> 2014-05-20 18:04:27,684 DEBUG [c.c.a.t.Request]
>>>> > >>>> (AgentManager-Handler-3:null) Seq 1-2130378876: Processing:
 {
>>>> Ans: ,
>>>> > >>>> MgmtId: 55157049428734, via: 1, Ver: v1, Flags: 10,
>>>> > >>>>
>>>> >
>>>> [{"com.cloud.agent.api.GetVncPortAnswer":{"address":"192.168.100.6","port":5902,"result":true,"wait":0}}]
>>>> > >>>> }
>>>> > >>>> 2014-05-20 18:04:27,684 DEBUG [c.c.a.t.Request]
>>>> > (catalina-exec-10:null)
>>>> > >>>> Seq 1-2130378876: Received:  { Ans: , MgmtId: 55157049428734,
>>>> via: 1,
>>>> > Ver:
>>>> > >>>> v1, Flags: 10, { GetVncPortAnswer } }
>>>> > >>>> 2014-05-20 18:04:27,684 DEBUG [c.c.s.ConsoleProxyServlet]
>>>> > >>>> (catalina-exec-10:null) Port info 192.168.100.6
>>>> > >>>> 2014-05-20 18:04:27,684 INFO  [c.c.s.ConsoleProxyServlet]
>>>> > >>>> (catalina-exec-10:null) Parse host info returned from
executing
>>>> > >>>> GetVNCPortCommand. host info: 192.168.100.6
>>>> > >>>> 2014-05-20 18:04:27,686 DEBUG [c.c.s.ConsoleProxyServlet]
>>>> > >>>> (catalina-exec-10:null) Compose console url:
>>>> > >>>>
>>>> >
>>>> https://192-168-100-159.realhostip.com/ajax?token=CsPhU4m_R2ZoLIdXOtjo3y3humnQN20wt5fSPjbZOHtRh7nli7tiq0ZiWUuwCVIn7VwF6503ziEqCRejlRsVcsyQcUfemTRXlhAOpJUyRugyCuTjmbUIX3EY1cHnFMKwF8FXXZr_PgwyXGPEoOHhkdRgsyRiczbk_Unuh4KmRngATr0FPCLtqhwIMpnbLSYwpnFDz65k9lEJmK6IlXYKVpWXg2rpVEsQvaNlulrZdhMQ7qUbacn82EG43OY8nmwm1SYB8TrUFH5Btb1RHpJm9A
>>>> > >>>> 2014-05-20 18:04:27,686 DEBUG [c.c.s.ConsoleProxyServlet]
>>>> > >>>> (catalina-exec-10:null) the console url is ::
>>>> > >>>> <html><title>r-4-VM</title><frameset><frame
src="
>>>> > >>>>
>>>> >
>>>> https://192-168-100-159.realhostip.com/ajax?token=CsPhU4m_R2ZoLIdXOtjo3y3humnQN20wt5fSPjbZOHtRh7nli7tiq0ZiWUuwCVIn7VwF6503ziEqCRejlRsVcsyQcUfemTRXlhAOpJUyRugyCuTjmbUIX3EY1cHnFMKwF8FXXZr_PgwyXGPEoOHhkdRgsyRiczbk_Unuh4KmRngATr0FPCLtqhwIMpnbLSYwpnFDz65k9lEJmK6IlXYKVpWXg2rpVEsQvaNlulrZdhMQ7qUbacn82EG43OY8nmwm1SYB8TrUFH5Btb1RHpJm9A
>>>> > >>>> "></frame></frameset></html>
>>>> > >>>> 2014-05-20 18:04:29,216 DEBUG [c.c.a.m.AgentManagerImpl]
>>>> > >>>> (AgentManager-Handler-4:null) SeqA 2-545: Processing
Seq 2-545:
>>>>  {
>>>> > Cmd ,
>>>> > >>>> MgmtId: -1, via: 2, Ver: v1, Flags: 11,
>>>> > >>>>
>>>> >
>>>> [{"com.cloud.agent.api.ConsoleProxyLoadReportCommand":{"_proxyVmId":2,"_loadInfo":"{\n
>>>> > >>>>  \"connections\": []\n}","wait":0}}] }
>>>> > >>>>
>>>> > >>>> If I try to restart the system VMs with cloudstack-sysvmadm,
it
>>>> says:
>>>> > >>>>
>>>> > >>>> Stopping and starting 1 secondary storage vm(s)...
>>>> > >>>> curl: (7) couldn't connect to host
>>>> > >>>> ERROR: Failed to stop secondary storage vm with id
1
>>>> > >>>>
>>>> > >>>> Done stopping and starting secondary storage vm(s)
>>>> > >>>>
>>>> > >>>> Stopping and starting 1 console proxy vm(s)...
>>>> > >>>> curl: (7) couldn't connect to host
>>>> > >>>> ERROR: Failed to stop console proxy vm with id 2
>>>> > >>>>
>>>> > >>>> Done stopping and starting console proxy vm(s) .
>>>> > >>>>
>>>> > >>>> Stopping and starting 1 running routing vm(s)...
>>>> > >>>> curl: (7) couldn't connect to host
>>>> > >>>> 2
>>>> > >>>> Done restarting router(s).
>>>> > >>>>
>>>> > >>>> I notice there are now four entries for the same management
>>>> server in
>>>> > >>>> the mshost table, and they all are in an "Up" state
and the
>>>> "removed"
>>>> > field
>>>> > >>>> is NULL.  What's wrong with this system?
>>>> > >>>>
>>>> > >>>
>>>> > >>>
>>>> > >>
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message