Return-Path: X-Original-To: apmail-cloudstack-users-archive@www.apache.org Delivered-To: apmail-cloudstack-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5A2871828F for ; Mon, 20 Jul 2015 15:24:36 +0000 (UTC) Received: (qmail 63460 invoked by uid 500); 20 Jul 2015 15:16:41 -0000 Delivered-To: apmail-cloudstack-users-archive@cloudstack.apache.org Received: (qmail 63404 invoked by uid 500); 20 Jul 2015 15:16:41 -0000 Mailing-List: contact users-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@cloudstack.apache.org Delivered-To: mailing list users@cloudstack.apache.org Received: (qmail 63393 invoked by uid 99); 20 Jul 2015 15:16:41 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Jul 2015 15:16:41 +0000 Received: from [172.16.5.6] (unknown [91.209.117.250]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 7E8DC1A0040 for ; Mon, 20 Jul 2015 15:16:39 +0000 (UTC) Message-ID: <55AD10D4.1090908@apache.org> Date: Mon, 20 Jul 2015 16:16:36 +0100 From: Milamber Organization: Apache Software Fondation User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0 SeaMonkey/2.33.1 MIME-Version: 1.0 To: users@cloudstack.apache.org Subject: Re: HA feature - KVM - CloudStack 4.5.1 References: <55A97B0E.9050408@apache.org> <55AA8492.3080600@apache.org> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 20/07/2015 15:44, Luciano Castro wrote: > Hi ! > > My test today: I stopped other instance, and changed to HA Offer. I started > this instance. > > After, I shutdown gracefully the KVM host of it. Why a gracefully shutdown of the KVM host ? The HA process is to (re)start the HA VMs on a new host, the current host has been crashed or not available i.e. its cloudstack agent won't respond. If you stopped gently the cloudstack-agent, the CS mgr don't consider this to a crash, so the HA won't start. What's behavior do you expect? > > and I checked the investigators process: > > [root@1q2 ~]# grep -i Investigator > /var/log/cloudstack/management/management-server.log > > > [root@1q2 ~]# date > Mon Jul 20 14:39:43 UTC 2015 > > [root@1q2 ~]# ls -ltrh /var/log/cloudstack/management/management-server.log > -rw-rw-r--. 1 cloud cloud 14M Jul 20 14:39 > /var/log/cloudstack/management/management-server.log > > > > Nothing. I dont know how internally these process work. but seems that > they are not working well, agree? > > options value > ha.investigators.exclude nothing > ha.investigators.orde > SimpleInvestigator,XenServerInvestigator,KVMInvestigator,HypervInvestigator,VMwareInvestigator,PingInvestigator,ManagementIPSysVMInvestigator > investigate.retry.interval 60 > > There´s a way to check if these process are running ? > > [root@1q2 ~]# ps waux| grep -i java > root 11408 0.0 0.0 103252 880 pts/0 S+ 14:44 0:00 grep -i > java > cloud 24225 0.7 1.7 16982036 876412 ? Sl Jul16 43:48 > /usr/lib/jvm/jre-1.7.0/bin/java -Djava.awt.headless=true > -Dcom.sun.management.jmxremote=false -Xmx2g -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/var/log/cloudstack/management/ -XX:PermSize=512M > -XX:MaxPermSize=800m > -Djava.security.properties=/etc/cloudstack/management/java.security.ciphers > -classpath > :::/etc/cloudstack/management:/usr/share/cloudstack-management/setup:/usr/share/cloudstack-management/bin/bootstrap.jar:/usr/share/cloudstack-management/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar > -Dcatalina.base=/usr/share/cloudstack-management > -Dcatalina.home=/usr/share/cloudstack-management -Djava.endorsed.dirs= > -Djava.io.tmpdir=/usr/share/cloudstack-management/temp > -Djava.util.logging.config.file=/usr/share/cloudstack-management/conf/logging.properties > -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager > org.apache.catalina.startup.Bootstrap start > > > > Thanks > > > > On Sat, Jul 18, 2015 at 1:53 PM, Milamber wrote: > >> >> On 17/07/2015 22:26, Somesh Naidu wrote: >> >>> Perhaps, the management server don't reconize the host 3 totally down >>>> (ping alive? or some quorum don't ok) >>>> The only way to the mgt server to accept totally that the host 3 has a >>>> real problem that the host 3 has been reboot (around 12:44)? >>>> >>> The host disconnect was triggered at 12:19 on host 3. Mgmt server was >>> pretty sure the host is down (it was a graceful shutdown I believe) which >>> is why it triggered a disconnect and notified other nodes. There was no >>> checkhealth/checkonhost/etc. triggered; just the agent disconnected and all >>> listeners (ping/etc.) notified. >>> >>> At this time mgmt server should have scheduled HA on all VMs running on >>> that host. The HA investigators would then work their way identifying >>> whether the VMs are still running, if they need to be fenced, etc. But this >>> never happened. >>> >> >> AFAIK, stopping the cloudstack-agent service don't allow to start the HA >> process for the VMs hosted by the node. Seems normal to me that the HA >> process don't start at this moment. >> If I would start the HA process on a node, I go to the Web UI (or >> cloudmonkey) to change the state of the Host from Up to Maintenance. >> >> >> (after I can stop the CS-agent service if I need for exemple reboot a node) >> >> >> >>> Regards, >>> Somesh >>> >>> >>> -----Original Message----- >>> From: Milamber [mailto:milamber@apache.org] >>> Sent: Friday, July 17, 2015 6:01 PM >>> To: users@cloudstack.apache.org >>> Subject: Re: HA feature - KVM - CloudStack 4.5.1 >>> >>> >>> >>> On 17/07/2015 21:23, Somesh Naidu wrote: >>> >>>> Ok, so here are my findings. >>>> >>>> 1. Host ID 3 was shutdown around 2015-07-16 12:19:09 at which point >>>> management server called a disconnect. >>>> 2. Based on the logs, it seems VM IDs 32, 18, 39 and 46 were running on >>>> the host. >>>> 3. No HA tasks for any of these VMs at this time. >>>> 5. Management server restarted at around 2015-07-16 12:30:20. >>>> 6. Host ID 3 connected back at around 2015-07-16 12:44:08. >>>> 7. Management server identified the missing VMs and triggered HA on >>>> those. >>>> 8. The VMs were eventually started, all 4 of them. >>>> >>>> I am not 100% sure why HA wasn't triggered until 2015-07-16 12:30 (#3), >>>> but I know that management server restart caused it not happen until the >>>> host was reconnected. >>>> >>> Perhaps, the management server don't reconize the host 3 totally down >>> (ping alive? or some quorum don't ok) >>> The only way to the mgt server to accept totally that the host 3 has a >>> real problem that the host 3 has been reboot (around 12:44)? >>> >>> What is the storage subsystem? CLVMd? >>> >>> >>> Regards, >>>> Somesh >>>> >>>> >>>> -----Original Message----- >>>> From: Luciano Castro [mailto:luciano.castro@gmail.com] >>>> Sent: Friday, July 17, 2015 12:13 PM >>>> To: users@cloudstack.apache.org >>>> Subject: Re: HA feature - KVM - CloudStack 4.5.1 >>>> >>>> No problems Somesh, thanks for your help. >>>> >>>> Link of log: >>>> >>>> >>>> https://dl.dropboxusercontent.com/u/6774061/management-server.log.2015-07-16.gz >>>> >>>> Luciano >>>> >>>> On Fri, Jul 17, 2015 at 12:00 PM, Somesh Naidu >>>> wrote: >>>> >>>> How large is the management server logs dated 2015-07-16? I would like >>>>> to >>>>> review the logs. All the information I need from that incident should >>>>> be in >>>>> there so I don't need any more testing. >>>>> >>>>> Regards, >>>>> Somesh >>>>> >>>>> -----Original Message----- >>>>> From: Luciano Castro [mailto:luciano.castro@gmail.com] >>>>> Sent: Friday, July 17, 2015 7:58 AM >>>>> To: users@cloudstack.apache.org >>>>> Subject: Re: HA feature - KVM - CloudStack 4.5.1 >>>>> >>>>> Hi Somesh! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> [root@1q2 ~]# zgrep -i -E >>>>> >>>>> >>>>> 'SimpleIvestigator|KVMInvestigator|PingInvestigator|ManagementIPSysVMInvestigator' >>>>> /var/log/cloudstack/management/management-server.log.2015-07-16.gz |tail >>>>> -5000 > /tmp/management.txt >>>>> [root@1q2 ~]# cat /tmp/management.txt >>>>> 2015-07-16 12:30:45,452 DEBUG [o.a.c.s.l.r.ExtensionRegistry] >>>>> (main:null) >>>>> Registering extension [KVMInvestigator] in [Ha Investigators Registry] >>>>> 2015-07-16 12:30:45,452 DEBUG [o.a.c.s.l.r.RegistryLifecycle] >>>>> (main:null) >>>>> Registered com.cloud.ha.KVMInvestigator@57ceec9a >>>>> 2015-07-16 12:30:45,927 DEBUG [o.a.c.s.l.r.ExtensionRegistry] >>>>> (main:null) >>>>> Registering extension [PingInvestigator] in [Ha Investigators Registry] >>>>> 2015-07-16 12:30:45,928 DEBUG [o.a.c.s.l.r.ExtensionRegistry] >>>>> (main:null) >>>>> Registering extension [ManagementIPSysVMInvestigator] in [Ha >>>>> Investigators >>>>> Registry] >>>>> 2015-07-16 12:30:53,796 INFO [o.a.c.s.l.r.DumpRegistry] (main:null) >>>>> Registry [Ha Investigators Registry] contains [SimpleInvestigator, >>>>> XenServerInvestigator, KVMInv >>>>> >>>>> I searched this log before, but as I thought that had not nothing >>>>> special. >>>>> >>>>> If you want propose to me another scenario of test, I can do it. >>>>> >>>>> Thanks >>>>> >>>>> >>>>> On Thu, Jul 16, 2015 at 7:27 PM, Somesh Naidu >>>>> wrote: >>>>> >>>>> What about other investigators, specifically " KVMInvestigator, >>>>>> PingInvestigator"? They report the VMs as alive=false too? >>>>>> >>>>>> Also, it is recommended that you look at the management-sever.log >>>>>> instead >>>>>> of catalina.out (for one, the latter doesn’t have timestamp). >>>>>> >>>>>> Regards, >>>>>> Somesh >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Luciano Castro [mailto:luciano.castro@gmail.com] >>>>>> Sent: Thursday, July 16, 2015 1:14 PM >>>>>> To: users@cloudstack.apache.org >>>>>> Subject: Re: HA feature - KVM - CloudStack 4.5.1 >>>>>> >>>>>> Hi Somesh! >>>>>> >>>>>> >>>>>> thanks for help.. I did again ,and I collected new logs: >>>>>> >>>>>> My vm_instance name is i-2-39-VM. There was some routers in KVM host >>>>>> 'A' >>>>>> (this one that I powered off now): >>>>>> >>>>>> >>>>>> [root@1q2 ~]# grep -i -E 'SimpleInvestigator.*false' >>>>>> /var/log/cloudstack/management/catalina.out >>>>>> INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-e2f91c9c >>>>>> >>>>> work-3) >>>>> >>>>>> SimpleInvestigator found VM[DomainRouter|r-4-VM]to be alive? false >>>>>> INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-729acf4f >>>>>> >>>>> work-7) >>>>> >>>>>> SimpleInvestigator found VM[User|i-23-33-VM]to be alive? false >>>>>> INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-a66a4941 >>>>>> >>>>> work-8) >>>>> >>>>>> SimpleInvestigator found VM[DomainRouter|r-36-VM]to be alive? false >>>>>> INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-5977245e >>>>>> work-10) SimpleInvestigator found VM[User|i-17-26-VM]to be alive? false >>>>>> INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-c7f39be0 >>>>>> >>>>> work-9) >>>>> >>>>>> SimpleInvestigator found VM[DomainRouter|r-32-VM]to be alive? false >>>>>> INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-ad4f5fda >>>>>> work-10) SimpleInvestigator found VM[DomainRouter|r-46-VM]to be alive? >>>>>> false >>>>>> INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-0:ctx-0257f5af >>>>>> work-11) SimpleInvestigator found VM[User|i-4-52-VM]to be alive? false >>>>>> INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-7ddff382 >>>>>> work-12) SimpleInvestigator found VM[DomainRouter|r-32-VM]to be alive? >>>>>> false >>>>>> INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-9f79917e >>>>>> work-13) SimpleInvestigator found VM[User|i-2-39-VM]to be alive? false >>>>>> >>>>>> >>>>>> >>>>>> KVM host 'B' agent log (where the machine would be migrate): >>>>>> >>>>>> 2015-07-16 16:58:56,537 INFO [kvm.resource.LibvirtComputingResource] >>>>>> (agentRequest-Handler-4:null) Live migration of instance i-2-39-VM >>>>>> initiated >>>>>> 2015-07-16 16:58:57,540 INFO [kvm.resource.LibvirtComputingResource] >>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to >>>>>> complete, waited 1000ms >>>>>> 2015-07-16 16:58:58,541 INFO [kvm.resource.LibvirtComputingResource] >>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to >>>>>> complete, waited 2000ms >>>>>> 2015-07-16 16:58:59,542 INFO [kvm.resource.LibvirtComputingResource] >>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to >>>>>> complete, waited 3000ms >>>>>> 2015-07-16 16:59:00,543 INFO [kvm.resource.LibvirtComputingResource] >>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to >>>>>> complete, waited 4000ms >>>>>> 2015-07-16 16:59:01,245 INFO [kvm.resource.LibvirtComputingResource] >>>>>> (agentRequest-Handler-4:null) Migration thread for i-2-39-VM is done >>>>>> >>>>>> It said done for my i-2-39-VM instance, but I can´t ping this host. >>>>>> >>>>>> Luciano >>>>>> >>>>>> >>>>> -- >>>>> Luciano Castro >>>>> >>>>> >