cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prachi Damle (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-4620) Vm failed to start on the host on which it was running due to not having enough reservedMem when the host was powered on after being shutdown.
Date Thu, 12 Dec 2013 00:44:07 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845926#comment-13845926
] 

Prachi Damle commented on CLOUDSTACK-4620:
------------------------------------------

Root cause analysis:
---------------------------
Scenario Observed:
------------------------
The VM 'tempsnap' fails to find any reserved capacity on the host:

2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533
186b3cbed ]) STATS: Failed to alloc resource from host: 1 reservedCpu: 1500, requested cpu:
500, reservedMem: 0, requested
 mem: 536870912
2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533
186b3cbed ]) Host does not have enough reserved RAM available, cannot allocate to this host.

Even if there is no more reserved capacity, we will try to see if this host has any free capacity
to start the VM. 
While checking this we find that the host has crossed the CPU threshold limit, so no VMs can
be allocated any more. Hence we error out to start this VM. Logs:

2013-09-05 12:52:44,943 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533186b3c
bed ]) Cannot allocate cluster list [1] for vm creation since their allocated percentage crosses
the disable capacity thre
shold defined at each cluster/ at global value for capacity Type : 1, skipping these clusters

However after some time the same VM starts back fine - this happens because the CapacityChecker
thread runs in between and corrects the capacity numbers of this host.

Problem: 
-------------
Why does the host cross CPU threshold limit, if there are no new VMs getting deployed?


When the host is shutdown, the SSVM and CPVM keep on trying to start back over and over. 

In case of SSVM, every time a new SSVM is created, allocated using the host's available free
capacity - when it fails to start the SSVM is destryoed and the allocated capacity is freed.
SSVM does not cause any capacity bug.

But in case of CPVM, we use the same CPVM entry - so it tries to start the CPVM on the last
host using the reserved capacity. However when it fails to start, the capacity is not added
back to the reserved quota. Thus the CPVM keeps on subtracting capacity from reserved quota
on each try, but never adds it back on failure to start.

Now when CS detects that the host is down, all user VMs enter 'Stopped' state and all Vm's
capacity is put into reserved quota. CPVM retries keep on reducing this quota - RAM requirement
of CPVM is higher than that of a user VM - so it reduces the RAM faster to zero than the CPU.

When host comes back, all user VMs try to start back - first they try to use reserved capacity
- but since the reserved RAM is zero, they use up the free capacity - Thus the user VMs keep
on increasing the 'used' CPU value of the host without reducing the 'reserved' CPU (which
was reserved when they got Stopped) 

So at some point, the (used + reserved) CPU crosses the threshold limit, causing failures
in starting any more user VMs

Why this is not a big issue?
-------------
The above situation gets corrected when the CapacityChecker thread runs and corrects the host's
reserved capacity. The thread runs every 5 minutes.

Thus on next try the user VM starts fine because this thread has corrected the used + reserved
> threshold situation.










 


> Vm failed to start on the host on which it was running due to not having enough reservedMem
when the host was powered on after being shutdown.
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-4620
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-4620
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Management Server
>    Affects Versions: 4.2.1
>         Environment: Build from 4.2-forward
>            Reporter: Sangeetha Hariharan
>            Assignee: Prachi Damle
>             Fix For: 4.3.0
>
>         Attachments: hostdown.rar
>
>
> Vm failed to start on the host on which it was running due to no having enough reservedMem
when the host was powered on after being shutdown
> Steps to reproduce the problem:
> Advanced zone with 1 cluster having 1 host (Xenserver).
> Had SSVM,CCPVM, 2 routers and few user Vms running in the host.
> Power down the host. 
> After few hours, powered on the host.
> All the Vms running on this host were marked "Stopped".
> Tried to start all the user Vms running in this host.
> 1 of the user Vms fails to start because of not having enough "Reserved RAM"
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Reserved RAM: 0 , Requested RAM: 536870912
> When i tried to start the same Vm  again after few minutes , it started successfully
on the same host.
> Seems like there is some issue with releasing the capacity when all the Vms get marked
as "Stopped" by VM sync process.
> Vm that failed to start because of capacity and then eventually succeeded when starting
after few minutes is "temfromsnap" .
> Management server logs when starting the VM fails to start in the last_host_id.
> 2013-09-05 12:52:44,934 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) DeploymentPlanner allocation algorithm: com.cloud.deploy.FirstFitPlanner_EnhancerByCloudStack_b297c61
> b@7e43d432
> 2013-09-05 12:52:44,934 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) Trying to allocate a host and storage pools from dc:1, pod:1,cluster:1,
requested cpu: 500, requested
>  ram: 536870912
> 2013-09-05 12:52:44,934 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) Is ROOT volume READY (pool already allocated)?: Yes
> 2013-09-05 12:52:44,934 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) This VM has last host_id specified, trying to choose the same host:
1
> 2013-09-05 12:52:44,938 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Checking if host: 1 has enough capacity for requested CPU: 500 and requested
RAM: 536870912 , cpuOverprovisio
> ningFactor: 1.0
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Hosts's actual total CPU: 9040 and CPU after applying overprovisioning:
9040
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) We need to allocate to the last host again, so checking if there is enough
reserved capacity
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Reserved CPU: 1500 , Requested CPU: 500
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Reserved RAM: 0 , Requested RAM: 536870912
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) STATS: Failed to alloc resource from host: 1 reservedCpu: 1500, requested
cpu: 500, reservedMem: 0, requested
>  mem: 536870912
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Host does not have enough reserved RAM available, cannot allocate to this
host.
> 2013-09-05 12:52:44,940 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) The last host of this VM does not have enough capacity
> 2013-09-05 12:52:44,940 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) Cannot choose the last host to deploy this VM
> 2013-09-05 12:52:44,940 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533186b3c
> bed ]) Searching resources only under specified Cluster: 1
> 2013-09-05 12:52:44,943 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-a441-533186b3c
> bed ]) Cannot allocate cluster list [1] for vm creation since their allocated percentage
crosses the disable capacity thre
> shold defined at each cluster/ at global value for capacity Type : 1, skipping these
clusters
> 2013-09-05 12:52:44,948 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84
= [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) Deploy avoids pods: [], clusters: [1], hosts: []



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message