cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Barys Dubauski (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CLOUDSTACK-10400) VPC Router Corruption when working with large number of networks containing instances with public IP addresses
Date Sat, 03 Nov 2018 02:26:00 GMT

     [ https://issues.apache.org/jira/browse/CLOUDSTACK-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Barys Dubauski updated CLOUDSTACK-10400:
----------------------------------------
    Description: 
We are using CloudStack 4.11.1 running with KVM hosts.  To simulate our usecase, we created
a small program that calls CloudStack API to

1) create VPC network with 20 guest networks, each containing one virtual machine with a public
IP address allocated.  

2) delete the machines and networks one by one. 

 

However,  we frequently get a timeout error, sometimes during VM deletion, and sometimes
during guest network deletion or even during static NAT disable step.  Once the timeout occurs,
it seems that the VPC network / Virtual router is in an *unstable/corrupted* state.  We need
to restart the Virtual Router with a clean option (sometimes have to try restart several times
as it fails to deploy router VM as well).  After that, we can continue delete the network
remaining environment.  Here is the high level steps that we did:
 # Create VPC Network
 # For each of the 20 "environments"
 ## Create Guest Network
 ## Add a VM to the network
 ## Acquire Public IP
 ## Associate the Public IP with VM
 # For each of the 20 environment
 ## Disassociate the Public IP
 ## Delete VM
 ## Delete Guest network
 # Delete VPC

 

The hanging / timeout problems could be in any time during environment deletion.  The first
few deletion could go through successfully, and then fail at some point.  The failure could
be in any stage.  i.e. Disassociate public IP, delete VM or delete guest network.  We looked
at cloud.log, agent log and management server log but couldn’t get any obvious errors. 
It seems that management server sends the request to do the deletion, but the VR does not
respond and the system/network becomes stuck in an invalid state. Network often gets stuck
in “Shutdown” state as a result.

 

Here are some errors in the management server log:

============================================
 2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4
job-29965) (logid:dbe80d4f) Complete async job-29965, jobStatus: FAILED, resultCode: 530,
result: org.apache.cloudstack.api.response.ExceptionResponse/null/\{"uuidList":[],"errorcode":530,"errortext":"Failed
to delete network"}

2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request] (API-Job-Executor-119:ctx-c14b2ab4 job-29965
ctx-eb2dda94) (logid:dbe80d4f) Seq 4-667095694804259240: Received: 

{ Ans: , MgmtId: [7474664765770|tel:7474664765770], via: 4([cehv02.core.jazz.net|http://cehv02.core.jazz.net/]),
Ver: v1, Flags: 110, \\{ GroupAnswer }

}
 2018-11-01 01:15:29,245 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Unable to destroy guest network on router VM*[DomainRouter|r-3388-VM]
 2018-11-01 01:15:29,247 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Failed to destroy guest network config Ntwk*[1122|Guest|12]
on router VM[DomainRouter|r-3388-VM]
 2018-11-01 01:15:29,247 WARN  [c.c.n.e.VpcVirtualRouterElement] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Failed to unplug nic in network Ntwk*[1122|Guest|12]
for virtual router VM[DomainRouter|r-3388-VM]
 2018-11-01 01:15:29,247 WARN  [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Unable to complete shutdown of the network elements
due to element: VpcVirtualRouter*
 2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) Lock is released for network Ntwk[1122|Guest|12]
as a part of network shutdown
 2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Network is not not in the correct state to be destroyed:
Shutdown*

============================================

 

I'm attaching the simple java program which performs all of the above described steps and
which allowed us to consistently run into the bug.

 

To use the application:

 

java -jar testCloudStack.jar <CloudStack API url: e.g. [http://foo:8080/client/api]>
<apiKey> <secretKey> <zoneName>

 

Note, that the test application works successfully with CloudStack server 4.9.2 but consistently
reproduces the bug with CloudStack server 4.11.1

  was:
We are using CloudStack 4.11.1 running with KVM hosts.  To simulate our usecase, we created
a small program that calls CloudStack API to

1) create VPC network with 20 guest networks, each containing one virtual machine with a public
IP address allocated.  

2) delete the machines and networks one by one. 

 

However,  we frequently get a timeout error, sometimes during VM deletion, and sometimes
during guest network deletion or even during static NAT disable step.  Once the timeout occurs,
it seems that the VPC network / Virtual router is in an *unstable/corrupted* state.  We need
to restart the Virtual Router with a clean option (sometimes have to try restart several times
as it fails to deploy router VM as well).  After that, we can continue delete the network
remaining environment.  Here is the high level steps that we did:
 # Create VPC Network
 # For each of the 20 "environments"
 ## Create Guest Network
 ## Add a VM to the network
 ## Acquire Public IP
 ## Associate the Public IP with VM
 # For each of the 20 environment
 ## Disassociate the Public IP
 ## Delete VM
 ## Delete Guest network
 # Delete VPC

 

The hanging / timeout problems could be in any time during environment deletion.  The first
few deletion could go through successfully, and then fail at some point.  The failure could
be in any stage.  i.e. Disassociate public IP, delete VM or delete guest network.  We look
at cloud.log, agent log and management server log but couldn’t get any obvious errors. 
It may seems that management server sends the request to do the deletion, but the VR does
not respond and the system/network becomes stuck in an invalid state. Network is often gets
stuck in “Shutdown” state as a result

 

Here are some error in the management server log:

============================================
2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4
job-29965) (logid:dbe80d4f) Complete async job-29965, jobStatus: FAILED, resultCode: 530,
result: org.apache.cloudstack.api.response.ExceptionResponse/null/\{"uuidList":[],"errorcode":530,"errortext":"Failed
to delete network"}

2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request] (API-Job-Executor-119:ctx-c14b2ab4 job-29965
ctx-eb2dda94) (logid:dbe80d4f) Seq 4-667095694804259240: Received:  { Ans: , MgmtId: [7474664765770|tel:7474664765770],
via: 4([cehv02.core.jazz.net|http://cehv02.core.jazz.net/]), Ver: v1, Flags: 110, \{ GroupAnswer
} }
2018-11-01 01:15:29,245 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Unable to destroy guest network on router VM*[DomainRouter|r-3388-VM]
2018-11-01 01:15:29,247 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Failed to destroy guest network config Ntwk*[1122|Guest|12]
on router VM[DomainRouter|r-3388-VM]
2018-11-01 01:15:29,247 WARN  [c.c.n.e.VpcVirtualRouterElement] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Failed to unplug nic in network Ntwk*[1122|Guest|12]
for virtual router VM[DomainRouter|r-3388-VM]
2018-11-01 01:15:29,247 WARN  [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Unable to complete shutdown of the network elements
due to element: VpcVirtualRouter*
2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) Lock is released for network Ntwk[1122|Guest|12]
as a part of network shutdown
2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Network is not not in the correct state to be destroyed:
Shutdown*

============================================

 

I'm attaching the simple java program which performs all of the above described steps and
which allowed us to consistently run into the bug.

 

To use the application:

 

java -jar testCloudStack.jar <CloudStack API url: e.g. http://foo:8080/client/api> <apiKey>
<secretKey> <zoneName>

 

Note, that the test application works successfully with CloudStack server 4.9.2 but consistently
reproduces the bug with CloudStack server 4.11.1


> VPC Router Corruption when working with large number of networks containing instances
with public IP addresses 
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-10400
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10400
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: API
>    Affects Versions: 4.11.1.0
>            Reporter: Barys Dubauski
>            Priority: Critical
>         Attachments: testCloudStack.jar
>
>
> We are using CloudStack 4.11.1 running with KVM hosts.  To simulate our usecase, we
created a small program that calls CloudStack API to
> 1) create VPC network with 20 guest networks, each containing one virtual machine with
a public IP address allocated.  
> 2) delete the machines and networks one by one. 
>  
> However,  we frequently get a timeout error, sometimes during VM deletion, and sometimes
during guest network deletion or even during static NAT disable step.  Once the timeout occurs,
it seems that the VPC network / Virtual router is in an *unstable/corrupted* state.  We need
to restart the Virtual Router with a clean option (sometimes have to try restart several times
as it fails to deploy router VM as well).  After that, we can continue delete the network
remaining environment.  Here is the high level steps that we did:
>  # Create VPC Network
>  # For each of the 20 "environments"
>  ## Create Guest Network
>  ## Add a VM to the network
>  ## Acquire Public IP
>  ## Associate the Public IP with VM
>  # For each of the 20 environment
>  ## Disassociate the Public IP
>  ## Delete VM
>  ## Delete Guest network
>  # Delete VPC
>  
> The hanging / timeout problems could be in any time during environment deletion.  The
first few deletion could go through successfully, and then fail at some point.  The failure
could be in any stage.  i.e. Disassociate public IP, delete VM or delete guest network. 
We looked at cloud.log, agent log and management server log but couldn’t get any obvious
errors.  It seems that management server sends the request to do the deletion, but the VR
does not respond and the system/network becomes stuck in an invalid state. Network often
gets stuck in “Shutdown” state as a result.
>  
> Here are some errors in the management server log:
> ============================================
>  2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4
job-29965) (logid:dbe80d4f) Complete async job-29965, jobStatus: FAILED, resultCode: 530,
result: org.apache.cloudstack.api.response.ExceptionResponse/null/\{"uuidList":[],"errorcode":530,"errortext":"Failed
to delete network"}
> 2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request] (API-Job-Executor-119:ctx-c14b2ab4 job-29965
ctx-eb2dda94) (logid:dbe80d4f) Seq 4-667095694804259240: Received: 
> { Ans: , MgmtId: [7474664765770|tel:7474664765770], via: 4([cehv02.core.jazz.net|http://cehv02.core.jazz.net/]),
Ver: v1, Flags: 110, \\{ GroupAnswer }
> }
>  2018-11-01 01:15:29,245 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Unable to destroy guest network on router VM*[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Failed to destroy guest network config Ntwk*[1122|Guest|12]
on router VM[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  [c.c.n.e.VpcVirtualRouterElement] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Failed to unplug nic in network Ntwk*[1122|Guest|12]
for virtual router VM[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Unable to complete shutdown of the network elements
due to element: VpcVirtualRouter*
>  2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) Lock is released for network Ntwk[1122|Guest|12]
as a part of network shutdown
>  2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4
job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Network is not not in the correct state to be destroyed:
Shutdown*
> ============================================
>  
> I'm attaching the simple java program which performs all of the above described steps
and which allowed us to consistently run into the bug.
>  
> To use the application:
>  
> java -jar testCloudStack.jar <CloudStack API url: e.g. [http://foo:8080/client/api]>
<apiKey> <secretKey> <zoneName>
>  
> Note, that the test application works successfully with CloudStack server 4.9.2 but consistently
reproduces the bug with CloudStack server 4.11.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message