cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Lair <sl...@ippathways.com>
Subject RE: Found 2 bugs in ACS 4.10. Possibly exist in 4.11 (master)
Date Sun, 07 Jan 2018 18:16:59 GMT
I use a wildcard cert on 4.9.2 it it's fine. We haven't gone to 4.10 yet to test.  We'll prob
go straight to 4.11 when released.

We have also had the high-cpu on the mgmt servers in our 4.9.x deployments.  It is very frusterating,
it also happens every few days.  Haven't been able to track down why yet.  In a different
thread a while ago, Simon Weller also reporting high cpu issues but I'm not sure if he every
found the culprit either.



-----Original Message-----
From: Ivan Kudryavtsev [mailto:kudryavtsev_ia@bw-sw.com] 
Sent: Saturday, January 6, 2018 12:28 AM
To: dev <dev@cloudstack.apache.org>
Subject: Found 2 bugs in ACS 4.10. Possibly exist in 4.11 (master)

Hello, colleagues.

During last days I found 2 bugs which I believe is critical for 4.11 release. I would like
to share them here and get help if possible:

1. CPVM bug. I use wildcard certificate issued by Comodo CA. I uploaded it to CS via UI and
destroyed CPVM to force use it. It uses it like a charm, but after some amount of time it
loses it and console proxy connection is no longer not possible. After it's beging rebooted
or recreated everything works well. I'm not familiar with CPVM at all and can not even imaging
what can be wrong here.

1a. CPVM has debug enabled and logs include tons of messages like:

2018-01-06 06:13:57,069 DEBUG
[cloud.consoleproxy.ConsoleProxyAjaxImageHandler] (Thread-4159:null) AjaxImageHandler
/ajaximg?token=RcHSrvzegyrjZAlc1Wjifcwv9P8WwK3eH63SuIS8WFFGssxymmjdYkZ4-S4ilY1UHxX612Lt_5Xi1Z5JaoCfDSf_UCi8lTIsPEBlDpUEWQg1IblYu0HxvoDugX9J4XgAdpj74qg_U4pOs74dzdZFB50PB_HxcMhzUqd5plH914PmRDw5k0ONaa183CsGa7DcGVvWaR_eYP_8_CArahGAjHt04Kx227tjyMx4Zaju7iNyxpBWxtBC5YJyj8rjv7IeA_0Pevz91pWn6OE1pkeLwGeFSV8pZw4BWg95SG97A-I&key=2020&ts=1515219237015
2018-01-06 06:13:57,070 DEBUG
[cloud.consoleproxy.ConsoleProxyHttpHandlerHelper] (Thread-4159:null) decode token. host:
10.252.2.10
2018-01-06 06:13:57,070 DEBUG
[cloud.consoleproxy.ConsoleProxyHttpHandlerHelper] (Thread-4159:null) decode token. port:
5903
2018-01-06 06:13:57,070 DEBUG
[cloud.consoleproxy.ConsoleProxyHttpHandlerHelper] (Thread-4159:null) decode token. tag: 375c62b5-74d9-4494-8b79-0d7c76cff10f

Every opened session is dumped to logs. I suppose it's dangerous and could lead to FS overusage
and CPVM failure.

/dev/vda10                                              368M   63M  287M
19% /var/log

Might it be that (1) is a consequence of (1a)?

2. High CPU utilization bug. After management server is launched it uses 0 CPU because I run
development cloud. After two days I see that 2 cores are used 50% by management server processes,
several days ago I saw all management server processes utilized almost all CPU available.
Surprisingly It continues function (API, UI), no active API utlization in logs.

The only two suspicios things I found for the last incident are:

root@cs2-head1:/var/log/cloudstack/management# zgrep ERROR management-server.log.2018-01-04.gz
2018-01-04 12:58:23,391 ERROR [c.c.c.ClusterManagerImpl]
(localhost-startStop-1:null) (logid:) Unable to ping management server at
10.252.2.2:9090 due to ConnectException
2018-01-04 12:58:25,743 ERROR [c.c.u.PropertiesUtil]
(localhost-startStop-1:null) (logid:) Unable to find properties file:
commands.properties
2018-01-04 14:36:23,874 ERROR [c.c.u.PropertiesUtil]
(localhost-startStop-1:null) (logid:) Unable to find properties file:
commands.properties
2018-01-04 14:43:23,043 ERROR [c.c.v.VmWorkJobHandlerProxy]
(Work-Job-Executor-5:ctx-e566f561 job-38158/job-38188 ctx-b1887051)
(logid:be4b64e0) Invocation exception, caused by:
com.cloud.exception.InsufficientServerCapacityException: Unable to create a deployment for
VM[SecondaryStorageVm|s-24-VM]Scope=interface
com.cloud.dc.DataCenter; id=1
2018-01-04 14:43:23,043 ERROR [c.c.v.VmWorkJobHandlerProxy]
(Work-Job-Executor-4:ctx-faf69614 job-38155/job-38185 ctx-83290fa8)
(logid:65010252) Invocation exception, caused by:
com.cloud.exception.InsufficientServerCapacityException: Unable to create a deployment for
VM[ConsoleProxy|v-10-VM]Scope=interface
com.cloud.dc.DataCenter; id=1
2018-01-04 14:43:23,044 ERROR [c.c.v.VmWorkJobDispatcher]
(Work-Job-Executor-5:ctx-e566f561 job-38158/job-38188) (logid:be4b64e0) Unable to complete
AsyncJobVO {id:38188, userId: 1, accountId: 1,
instanceType: null, instanceId: null, cmd: com.cloud.vm.VmWorkStart,
cmdInfo:
rO0ABXNyABhjb20uY2xvdWQudm0uVm1Xb3JrU3RhcnR9cMGsvxz73gIAC0oABGRjSWRMAAZhdm9pZHN0ADBMY29tL2Nsb3VkL2RlcGxveS9EZXBsb3ltZW50UGxhbm5lciRFeGNsdWRlTGlzdDtMAAljbHVzdGVySWR0ABBMamF2YS9sYW5nL0xvbmc7TAAGaG9zdElkcQB-AAJMAAtqb3VybmFsTmFtZXQAEkxqYXZhL2xhbmcvU3RyaW5nO0wAEXBoeXNpY2FsTmV0d29ya0lkcQB-AAJMAAdwbGFubmVycQB-AANMAAVwb2RJZHEAfgACTAAGcG9vbElkcQB-AAJMAAlyYXdQYXJhbXN0AA9MamF2YS91dGlsL01hcDtMAA1yZXNlcnZhdGlvbklkcQB-AAN4cgATY29tLmNsb3VkLnZtLlZtV29ya5-ZtlbwJWdrAgAESgAJYWNjb3VudElkSgAGdXNlcklkSgAEdm1JZEwAC2hhbmRsZXJOYW1lcQB-AAN4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAAAGHQAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAAAHBwcHBwcHBwcHA,
cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0,
result: null, initMsid: 2485138019287, completeMsid: null, lastUpdated:
null, lastPolled: null, created: Thu Jan 04 14:43:22 KRAT 2018}, job
origin:38158
2018-01-04 14:43:23,044 ERROR [c.c.v.VmWorkJobDispatcher]
(Work-Job-Executor-4:ctx-faf69614 job-38155/job-38185) (logid:65010252) Unable to complete
AsyncJobVO {id:38185, userId: 1, accountId: 1,
instanceType: null, instanceId: null, cmd: com.cloud.vm.VmWorkStart,
cmdInfo:
rO0ABXNyABhjb20uY2xvdWQudm0uVm1Xb3JrU3RhcnR9cMGsvxz73gIAC0oABGRjSWRMAAZhdm9pZHN0ADBMY29tL2Nsb3VkL2RlcGxveS9EZXBsb3ltZW50UGxhbm5lciRFeGNsdWRlTGlzdDtMAAljbHVzdGVySWR0ABBMamF2YS9sYW5nL0xvbmc7TAAGaG9zdElkcQB-AAJMAAtqb3VybmFsTmFtZXQAEkxqYXZhL2xhbmcvU3RyaW5nO0wAEXBoeXNpY2FsTmV0d29ya0lkcQB-AAJMAAdwbGFubmVycQB-AANMAAVwb2RJZHEAfgACTAAGcG9vbElkcQB-AAJMAAlyYXdQYXJhbXN0AA9MamF2YS91dGlsL01hcDtMAA1yZXNlcnZhdGlvbklkcQB-AAN4cgATY29tLmNsb3VkLnZtLlZtV29ya5-ZtlbwJWdrAgAESgAJYWNjb3VudElkSgAGdXNlcklkSgAEdm1JZEwAC2hhbmRsZXJOYW1lcQB-AAN4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAAACnQAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAAAHBwcHBwcHBwcHA,
cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0,
result: null, initMsid: 2485138019287, completeMsid: null, lastUpdated:
null, lastPolled: null, created: Thu Jan 04 14:43:21 KRAT 2018}, job
origin:38155
2018-01-04 14:43:25,127 ERROR
[o.a.c.s.d.d.CloudStackPrimaryDataStoreDriverImpl]
(consoleproxy-1:ctx-6f2f9b7b) (logid:25acd369) No remote endpoint to send DeleteCommand, check
if host or ssvm is down?
2018-01-04 14:43:25,127 ERROR
[o.a.c.s.d.d.CloudStackPrimaryDataStoreDriverImpl]
(secstorage-1:ctx-ae3adf87) (logid:4db1e2a0) No remote endpoint to send DeleteCommand, check
if host or ssvm is down?


The worst thing is that I don't even have an idea how to catch it. Also, as I have second
management down, I see a lot of events like:
2017-12-28 05:08:34,927 DEBUG [c.c.c.ClusterManagerImpl]
(Cluster-Heartbeat-1:ctx-9bbae21c) (logid:9a1b0b21) Management server heartbeat takes too
long to finish. profiler: Done. Duration: 1935ms,
profilerHeartbeatUpdate: Done. Duration: 617ms, profilerPeerScan: Done.
Duration: 1317ms

Could this be a reason of high CPU utilization?

--
With best regards, Ivan Kudryavtsev
Bitworks Software, Ltd.
Cell: +7-923-414-1515
WWW: http://bitworks.software/ <http://bw-sw.com/>
Mime
View raw message