cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Syahrul Sazli Shaharir <sa...@nocser.net>
Subject Re: patchviasocket seems to be broken with qemu 2.3(+?)
Date Tue, 27 Dec 2016 01:42:48 GMT
Hi,

Update: after a reboot of all hosts during the weekend (resulting in 
reboot of all VMs), the problematic router VM is OK now. Not sure what 
had caused it.

Thanks.

On 2016-12-22 14:03, Syahrul Sazli Shaharir wrote:
> On 2016-12-21 23:26, Linas Žilinskas wrote:
>> At this point I'm not sure what the issue for you could be. Did you
>> try recreating the failing vrouter?
> 
> Yes, multiple times by destroying it and/or restarting the network -
> failed every time.
> 
>> Also, just in case, check if there's free disk space on it. We had
>> some vrouters stuck due to this, and i saw another thread here
>> discussing it.
> 
> Plenty of space in the stuck VM:-
> 
> root@r-691-VM:~# df -h
> Filesystem                                              Size  Used
> Avail Use% Mounted on
> rootfs                                                  461M  157M  
> 281M  36% /
> udev                                                     10M     0
> 10M   0% /dev
> tmpfs                                                    50M  236K
> 50M   1% /run
> /dev/disk/by-uuid/6a0427bc-6052-48de-a4b8-c82d8217ed1d  461M  157M  
> 281M  36% /
> tmpfs                                                   5.0M     0
> 5.0M   0% /run/lock
> tmpfs                                                   207M     0
> 207M   0% /run/shm
> /dev/vda1                                                73M   23M
> 47M  33% /boot
> /dev/vda6                                                92M  5.6M
> 81M   7% /home
> /dev/vda8                                               184M  6.2M
> 169M   4% /opt
> /dev/vda11                                               92M  5.6M
> 81M   7% /tmp
> /dev/vda7                                               751M  493M
> 219M  70% /usr
> /dev/vda9                                               563M  157M
> 377M  30% /var
> /dev/vda10                                              184M  7.2M
> 168M   5% /var/log
> 
> Thanks.
> 
>> 
>> Basically the /var/log/ partition fills up, since it's relatively
>> small. And if you had issues for a period of time with that specific
>> router and restarted it multiple times, the log partition might be
>> full.
>> 
>> On 21/12/16 06:35, Syahrul Sazli Shaharir wrote:
>> 
>>> On 2016-12-20 17:53, Wei ZHOU wrote:
>>> 
>>>> Hi Synhrul,
>>>> 
>>>> Could you upload the /var/log/cloud.log ?
>>> 
>>> Sure:-
>>> 
>>> Working router VM: http://pastebin.com/hwwk86ve
>>> 
>>> Non-working router VM: http://pastebin.com/G4nv09ab
>>> 
>>> Thanks.
>>> 
>>> -Wei
>>> 
>>> 2016-12-20 3:18 GMT+01:00 Syahrul Sazli Shaharir <sazli@nocser.net>:
>>> 
>>> 
>>> On 2016-12-19 18:10, Syahrul Sazli Shaharir wrote:
>>> 
>>> On 2016-12-19 17:03, Linas Žilinskas wrote:
>>> 
>>> From the logs it doesn't seem that the script timeouts. "Execution
>>> is
>>> successful", so it manages to pass the data over the socket.
>>> 
>>> I guess the systemvm just doesn't configure itself for some reason.
>>> 
>>> You are right, I was able to enter the router VM console at some
>>> point
>>> during the timeout loops, and able to capture syslog output during
>>> the
>>> loop:-
>>> 
>>> http://pastebin.com/n37aHeSa
>> 
>> I restarted another network, and that network's router VM was able to
>> be
>> recreated, even on the same host as the failed network (and both
>> networks
>> are exactly same configuration, only VLAN & subnet are different).
>> Comparing between the two syslog outputs during boot shows the
>> problematic
>> network router VM self-configuration got stuck in vm_dhcp_entry.json .
>> 
>> 
>> 1. Working network router VM : http://pastebin.com/Y6zpDa6M
>> 2. Non-working network router VM : http://pastebin.com/jzfGMGQB
>> 
>> Thanks.
>> 
>>> Also, in my personal tests, I noticed some different behaviour with
>>> 
>>>> different kernels. Don't remember the specifics right now, but on
>>>> some
>>>> combinations (qemu / kernel) the socket acted differently. For
>>>> example
>>>> the data was sent over the socket, but wasn't visible inside the
>>>> VM.
>>>> Other times the socket would be stuck from the host side.
>>>> 
>>>> So i would suggest testing different kernels (3.x, 4.4.x, 4.8.x)
>>>> or
>>>> try to login to the system vm and see what's happening from
>>>> inside.
>>> 
>>> Will do this next and feedback the results here.
>>> 
>>> Thanks for your help! :)
>>> 
>>> On 12/16/16 03:46, Syahrul Sazli Shaharir wrote:
>>> 
>>> On 2016-12-16 11:27, Syahrul Sazli Shaharir wrote:
>>> On Wed, 26 Oct 2016, Linas ?ilinskas wrote:
>>> 
>>> So after some investigation I've found out that qemu 2.3.0 is indeed
>>> 
>>> broken, at least the way CS uses the qemu chardev/socket.
>>> 
>>> Not sure in which specific version it happened, but it was fixed in
>>> 2.4.0-rc3, specifically noting that CloudStack 4.2 was not working.
>>> 
>>> qemu git commit: 4bf1cb03fbc43b0055af60d4ff093d6894aa4338
>>> 
>>> Also attaching the patch from that commit.
>>> 
>>> For our own purposes i've included the patch to the qemu-kvm-ev
>>> package (2.3.0) and all is well.
>>> 
>>> Hi,
>>> 
>>> I am facing the exact same issue on latest Cloudstack 4.9.0.1, on
>>> latest CentOS 7.3.1611, with latest qemu-kvm-ev-2.6.0-27.1.el7
>>> package.
>>> 
>>> The issue initially surfaced following a heartbeat-induced reset of
>>> all hosts, when it was on CS 4.8 @ CentOS 7.0 and stock
>>> qemu-kvm-1.5.3. Since then, the patchviasocket.pl/py timeouts
>>> persisted for 1 out of 4 router VM/networks, even after upgrading to
>>> 
>>> 
>>> latest code. (I have checked the qemu-kvm-ev-2.6.0-27.1.el7 source,
>>> and the patched code are pretty much still intact, as per the
>>> 2.4.0-rc3 commit).
>>> 
>>> Any help would be greatly appreciated.
>>> 
>>> Thanks.
>>> 
>>> (Attached are some debug logs from the host's agent.log)
>>> 
>>> Here are the debug logs as mentioned: http://pastebin.com/yHdsMNzZ
>>> 
>>> Thanks.
>>> 
>>> --sazli
>>> 
>>> On 2016-10-20 09:59, Linas ?ilinskas wrote:
>>> 
>>> Hi.
>>> 
>>> We have made an upgrade to 4.9.
>>> 
>>> Custom build packages with our own patches, which in my mind (i'm
>>> the only
>>> one patching those) should not affect the issue i'll describe.
>>> 
>>> I'm not sure whether we didn't notice it before, or it's actually
>>> related
>>> to something in 4.9
>>> 
>>> Basically our system vm's were unable to be patched via the qemu
>>> socket.
>>> The script simply error'ed out with a timeout while trying to push
>>> the
>>> data to the socket.
>>> 
>>> Executing it manually (with cmd line from the logs) resulted the
>>> same. I
>>> even tried the old perl variant, which also had same result.
>>> 
>>> So finally we found out that this issue happens only on our HVs
>>> which run
>>> qemu 2.3.0, from the centos 7 special interest virtualization repo.
>>> Other
>>> ones that run qemu 1.5, from official repos, can patch the system
>>> vms
>>> fine.
>>> 
>>> So i'm wondering if anyone tested 4.9 with kvm with qemu >= 2.x?
>>> Maybe it
>>> something else special in our setup. e.g. we're running the HVs
>>> from a
>>> preconfigured netboot image (pxe), but all of them, including those
>>> with
>>> qemu 1.5, so i have no idea.
>>> 
>>> Linas ?ilinskas
>>> Head of Development
>>> website <http://www.host1plus.com/> [1] [1] facebook
>>> <https://www.facebook.com/Host1Plus> [2] [2] twitter
>>> <https://twitter.com/Host1Plus> [3] [3] linkedin
>>> <https://www.linkedin.com/company/digital-energy-technologies-ltd.>
>>> [4]
>>> [4]
>>> 
>>> Host1Plus is a division of Digital Energy Technologies Ltd.
>>> 
>>> 26 York Street, London W1U 6PZ, United Kingdom
>>  --
>> --sazli
>> 
>> Linas Žilinskas
>> Head of Development
>> 
>> Links:
>> ------
>> [1] http://www.host1plus.com/
>> [2] https://www.facebook.com/Host1Plus
>> [3] https://twitter.com/Host1Plus
>> [4] https://www.linkedin.com/company/digital-energy-technologies-ltd.

-- 
--sazli

Mime
View raw message