mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chaitanya Bapat <chai.ba...@gmail.com>
Subject CI restart notification
Date Thu, 17 Oct 2019 00:00:56 GMT
Hello community,

Since yesterday Oct 15, 2019, we found that the CI/CD was facing an issue
related to auto-scaling. Lai and Pedro’s efforts helped to find the root
cause (concerning closed port on Jenkin’s slave). The issue was temporarily
resolved by removing the autoscaled instances & restarting the Jenkins
master. As a result, the PRs would need to be restarted.

We need to do a post-mortem on this issue but some take-home issues that
need to be fixed are:

   - Monitor number of slaves that failed to connect, add an alarm on
   threshold > 0 on failed to connect
   - Fix lambda error with a pending lifecycle (starting is not valid)
   - Deploy new lambda:
   https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html
   - Fix throttling problems with EC2.

We are actively working on retriggering the PRs for the community.
Apologies for the inconvenience caused.

Thank you,
Chai

-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
<https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat]
<https://www.facebook.com/chaibapchya>[image:
https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
https://www.linkedin.com//in/chaibapat25]
<https://www.linkedin.com//in/chaibapchya/>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message