Mailing-List: contact users-help@cloudstack.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@cloudstack.apache.org
Message-ID: <55AD10D4.1090908@apache.org>
Date: Mon, 20 Jul 2015 16:16:36 +0100
From: Milamber <milamber@apache.org>
Organization: Apache Software Fondation
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:36.0) Gecko/20100101 Firefox/36.0 SeaMonkey/2.33.1
MIME-Version: 1.0
To: users@cloudstack.apache.org
Subject: Re: HA feature - KVM - CloudStack 4.5.1
References: 
 <BC047208AFA8C3429141743ADC8C561D7578042A@FTLPEX01CL02.citrite.net>
	<55A97B0E.9050408@apache.org>
	<BC047208AFA8C3429141743ADC8C561D757804CF@FTLPEX01CL02.citrite.net>
	<55AA8492.3080600@apache.org>
 <CAAUxpF1PpFLQjmhZNw7HGRJmtt3DFBT8wNxrsFKokOLz6DEvbg@mail.gmail.com>
In-Reply-To: 
 <CAAUxpF1PpFLQjmhZNw7HGRJmtt3DFBT8wNxrsFKokOLz6DEvbg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit


On 20/07/2015 15:44, Luciano Castro wrote:
> Hi !
>
> My test today: I stopped other instance, and changed to HA Offer. I started
> this instance.
>
> After, I shutdown gracefully the KVM host of it.

Why a gracefully shutdown of the KVM host ? The HA process is to 
(re)start the HA VMs on a new host, the current host has been crashed or 
not available i.e. its cloudstack agent won't respond.
If you stopped gently the cloudstack-agent, the CS mgr don't consider 
this to a crash, so the HA won't start.

What's behavior do you expect?


>
> and I checked the investigators process:
>
> [root@1q2 ~]# grep -i Investigator
> /var/log/cloudstack/management/management-server.log
>
>
> [root@1q2 ~]# date
> Mon Jul 20 14:39:43 UTC 2015
>
> [root@1q2 ~]# ls -ltrh /var/log/cloudstack/management/management-server.log
> -rw-rw-r--. 1 cloud cloud 14M Jul 20 14:39
> /var/log/cloudstack/management/management-server.log
>
>
>
> Nothing.  I dont know how internally these process work. but seems that
> they are not working well, agree?
>
> options                     value
> ha.investigators.exclude     nothing
> ha.investigators.orde
> SimpleInvestigator,XenServerInvestigator,KVMInvestigator,HypervInvestigator,VMwareInvestigator,PingInvestigator,ManagementIPSysVMInvestigator
> investigate.retry.interval    60
>
> There´s a way to check if these process are running ?
>
> [root@1q2 ~]# ps waux| grep -i java
> root     11408  0.0  0.0 103252   880 pts/0    S+   14:44   0:00 grep -i
> java
> cloud    24225  0.7  1.7 16982036 876412 ?     Sl   Jul16  43:48
> /usr/lib/jvm/jre-1.7.0/bin/java -Djava.awt.headless=true
> -Dcom.sun.management.jmxremote=false -Xmx2g -XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/var/log/cloudstack/management/ -XX:PermSize=512M
> -XX:MaxPermSize=800m
> -Djava.security.properties=/etc/cloudstack/management/java.security.ciphers
> -classpath
> :::/etc/cloudstack/management:/usr/share/cloudstack-management/setup:/usr/share/cloudstack-management/bin/bootstrap.jar:/usr/share/cloudstack-management/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar
> -Dcatalina.base=/usr/share/cloudstack-management
> -Dcatalina.home=/usr/share/cloudstack-management -Djava.endorsed.dirs=
> -Djava.io.tmpdir=/usr/share/cloudstack-management/temp
> -Djava.util.logging.config.file=/usr/share/cloudstack-management/conf/logging.properties
> -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
> org.apache.catalina.startup.Bootstrap start
>
>
>
> Thanks
>
>
>
> On Sat, Jul 18, 2015 at 1:53 PM, Milamber <milamber@apache.org> wrote:
>
>>
>> On 17/07/2015 22:26, Somesh Naidu wrote:
>>
>>> Perhaps, the management server don't reconize the host 3 totally down
>>>> (ping alive? or some quorum don't ok)
>>>> The only way to the mgt server to accept totally that the host 3 has a
>>>> real problem that the host 3 has been reboot (around 12:44)?
>>>>
>>> The host disconnect was triggered at 12:19 on host 3. Mgmt server was
>>> pretty sure the host is down (it was a graceful shutdown I believe) which
>>> is why it triggered a disconnect and notified other nodes. There was no
>>> checkhealth/checkonhost/etc. triggered; just the agent disconnected and all
>>> listeners (ping/etc.) notified.
>>>
>>> At this time mgmt server should have scheduled HA on all VMs running on
>>> that host. The HA investigators would then work their way identifying
>>> whether the VMs are still running, if they need to be fenced, etc. But this
>>> never happened.
>>>
>>
>> AFAIK, stopping the cloudstack-agent service don't allow to start the HA
>> process for the VMs hosted by the node. Seems normal to me that the HA
>> process don't start at this moment.
>> If I would start the HA process on a node, I go to the Web UI (or
>> cloudmonkey) to change the state of the Host from Up to Maintenance.
>>
>>
>> (after I can stop the CS-agent service if I need for exemple reboot a node)
>>
>>
>>
>>> Regards,
>>> Somesh
>>>
>>>
>>> -----Original Message-----
>>> From: Milamber [mailto:milamber@apache.org]
>>> Sent: Friday, July 17, 2015 6:01 PM
>>> To: users@cloudstack.apache.org
>>> Subject: Re: HA feature - KVM - CloudStack 4.5.1
>>>
>>>
>>>
>>> On 17/07/2015 21:23, Somesh Naidu wrote:
>>>
>>>> Ok, so here are my findings.
>>>>
>>>> 1. Host ID 3 was shutdown around 2015-07-16 12:19:09 at which point
>>>> management server called a disconnect.
>>>> 2. Based on the logs, it seems VM IDs 32, 18, 39 and 46 were running on
>>>> the host.
>>>> 3. No HA tasks for any of these VMs at this time.
>>>> 5. Management server restarted at around 2015-07-16 12:30:20.
>>>> 6. Host ID 3 connected back at around 2015-07-16 12:44:08.
>>>> 7. Management server identified the missing VMs and triggered HA on
>>>> those.
>>>> 8. The VMs were eventually started, all 4 of them.
>>>>
>>>> I am not 100% sure why HA wasn't triggered until 2015-07-16 12:30 (#3),
>>>> but I know that management server restart caused it not happen until the
>>>> host was reconnected.
>>>>
>>> Perhaps, the management server don't reconize the host 3 totally down
>>> (ping alive? or some quorum don't ok)
>>> The only way to the mgt server to accept totally that the host 3 has a
>>> real problem that the host 3 has been reboot (around 12:44)?
>>>
>>> What is the storage subsystem? CLVMd?
>>>
>>>
>>>   Regards,
>>>> Somesh
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Luciano Castro [mailto:luciano.castro@gmail.com]
>>>> Sent: Friday, July 17, 2015 12:13 PM
>>>> To: users@cloudstack.apache.org
>>>> Subject: Re: HA feature - KVM - CloudStack 4.5.1
>>>>
>>>> No problems Somesh, thanks for your help.
>>>>
>>>> Link of log:
>>>>
>>>>
>>>> https://dl.dropboxusercontent.com/u/6774061/management-server.log.2015-07-16.gz
>>>>
>>>> Luciano
>>>>
>>>> On Fri, Jul 17, 2015 at 12:00 PM, Somesh Naidu <Somesh.Naidu@citrix.com>
>>>> wrote:
>>>>
>>>>   How large is the management server logs dated 2015-07-16? I would like
>>>>> to
>>>>> review the logs. All the information I need from that incident should
>>>>> be in
>>>>> there so I don't need any more testing.
>>>>>
>>>>> Regards,
>>>>> Somesh
>>>>>
>>>>> -----Original Message-----
>>>>> From: Luciano Castro [mailto:luciano.castro@gmail.com]
>>>>> Sent: Friday, July 17, 2015 7:58 AM
>>>>> To: users@cloudstack.apache.org
>>>>> Subject: Re: HA feature - KVM - CloudStack 4.5.1
>>>>>
>>>>> Hi Somesh!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [root@1q2 ~]# zgrep -i -E
>>>>>
>>>>>
>>>>> 'SimpleIvestigator|KVMInvestigator|PingInvestigator|ManagementIPSysVMInvestigator'
>>>>> /var/log/cloudstack/management/management-server.log.2015-07-16.gz |tail
>>>>> -5000 > /tmp/management.txt
>>>>> [root@1q2 ~]# cat /tmp/management.txt
>>>>> 2015-07-16 12:30:45,452 DEBUG [o.a.c.s.l.r.ExtensionRegistry]
>>>>> (main:null)
>>>>> Registering extension [KVMInvestigator] in [Ha Investigators Registry]
>>>>> 2015-07-16 12:30:45,452 DEBUG [o.a.c.s.l.r.RegistryLifecycle]
>>>>> (main:null)
>>>>> Registered com.cloud.ha.KVMInvestigator@57ceec9a
>>>>> 2015-07-16 12:30:45,927 DEBUG [o.a.c.s.l.r.ExtensionRegistry]
>>>>> (main:null)
>>>>> Registering extension [PingInvestigator] in [Ha Investigators Registry]
>>>>> 2015-07-16 12:30:45,928 DEBUG [o.a.c.s.l.r.ExtensionRegistry]
>>>>> (main:null)
>>>>> Registering extension [ManagementIPSysVMInvestigator] in [Ha
>>>>> Investigators
>>>>> Registry]
>>>>> 2015-07-16 12:30:53,796 INFO  [o.a.c.s.l.r.DumpRegistry] (main:null)
>>>>> Registry [Ha Investigators Registry] contains [SimpleInvestigator,
>>>>> XenServerInvestigator, KVMInv
>>>>>
>>>>> I  searched  this log before, but as I thought that had not nothing
>>>>> special.
>>>>>
>>>>> If you want propose to me another scenario of test, I can do it.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Thu, Jul 16, 2015 at 7:27 PM, Somesh Naidu <Somesh.Naidu@citrix.com>
>>>>> wrote:
>>>>>
>>>>>   What about other investigators, specifically " KVMInvestigator,
>>>>>> PingInvestigator"? They report the VMs as alive=false too?
>>>>>>
>>>>>> Also, it is recommended that you look at the management-sever.log
>>>>>> instead
>>>>>> of catalina.out (for one, the latter doesn’t have timestamp).
>>>>>>
>>>>>> Regards,
>>>>>> Somesh
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Luciano Castro [mailto:luciano.castro@gmail.com]
>>>>>> Sent: Thursday, July 16, 2015 1:14 PM
>>>>>> To: users@cloudstack.apache.org
>>>>>> Subject: Re: HA feature - KVM - CloudStack 4.5.1
>>>>>>
>>>>>> Hi Somesh!
>>>>>>
>>>>>>
>>>>>> thanks for help.. I did again ,and I collected new logs:
>>>>>>
>>>>>> My vm_instance name is i-2-39-VM. There was some routers in KVM host
>>>>>> 'A'
>>>>>> (this one that I powered off now):
>>>>>>
>>>>>>
>>>>>> [root@1q2 ~]# grep -i -E 'SimpleInvestigator.*false'
>>>>>> /var/log/cloudstack/management/catalina.out
>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-e2f91c9c
>>>>>>
>>>>> work-3)
>>>>>
>>>>>> SimpleInvestigator found VM[DomainRouter|r-4-VM]to be alive? false
>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-729acf4f
>>>>>>
>>>>> work-7)
>>>>>
>>>>>> SimpleInvestigator found VM[User|i-23-33-VM]to be alive? false
>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-a66a4941
>>>>>>
>>>>> work-8)
>>>>>
>>>>>> SimpleInvestigator found VM[DomainRouter|r-36-VM]to be alive? false
>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-5977245e
>>>>>> work-10) SimpleInvestigator found VM[User|i-17-26-VM]to be alive? false
>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-c7f39be0
>>>>>>
>>>>> work-9)
>>>>>
>>>>>> SimpleInvestigator found VM[DomainRouter|r-32-VM]to be alive? false
>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-ad4f5fda
>>>>>> work-10) SimpleInvestigator found VM[DomainRouter|r-46-VM]to be alive?
>>>>>> false
>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-0:ctx-0257f5af
>>>>>> work-11) SimpleInvestigator found VM[User|i-4-52-VM]to be alive? false
>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-7ddff382
>>>>>> work-12) SimpleInvestigator found VM[DomainRouter|r-32-VM]to be alive?
>>>>>> false
>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-9f79917e
>>>>>> work-13) SimpleInvestigator found VM[User|i-2-39-VM]to be alive? false
>>>>>>
>>>>>>
>>>>>>
>>>>>> KVM  host 'B' agent log (where the machine would be migrate):
>>>>>>
>>>>>> 2015-07-16 16:58:56,537 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>> (agentRequest-Handler-4:null) Live migration of instance i-2-39-VM
>>>>>> initiated
>>>>>> 2015-07-16 16:58:57,540 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to
>>>>>> complete, waited 1000ms
>>>>>> 2015-07-16 16:58:58,541 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to
>>>>>> complete, waited 2000ms
>>>>>> 2015-07-16 16:58:59,542 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to
>>>>>> complete, waited 3000ms
>>>>>> 2015-07-16 16:59:00,543 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to
>>>>>> complete, waited 4000ms
>>>>>> 2015-07-16 16:59:01,245 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>> (agentRequest-Handler-4:null) Migration thread for i-2-39-VM is done
>>>>>>
>>>>>> It said done for my i-2-39-VM instance, but I can´t ping this host.
>>>>>>
>>>>>> Luciano
>>>>>>
>>>>>>
>>>>> --
>>>>> Luciano Castro
>>>>>
>>>>>
>