mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chaitanya Bapat <>
Subject CI restart notification
Date Thu, 17 Oct 2019 00:00:56 GMT
Hello community,

Since yesterday Oct 15, 2019, we found that the CI/CD was facing an issue
related to auto-scaling. Lai and Pedro’s efforts helped to find the root
cause (concerning closed port on Jenkin’s slave). The issue was temporarily
resolved by removing the autoscaled instances & restarting the Jenkins
master. As a result, the PRs would need to be restarted.

We need to do a post-mortem on this issue but some take-home issues that
need to be fixed are:

   - Monitor number of slaves that failed to connect, add an alarm on
   threshold > 0 on failed to connect
   - Fix lambda error with a pending lifecycle (starting is not valid)
   - Deploy new lambda:
   - Fix throttling problems with EC2.

We are actively working on retriggering the PRs for the community.
Apologies for the inconvenience caused.

Thank you,

*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

<>[image:] <>[image:]

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message