mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chaitanya Bapat <chai.ba...@gmail.com>
Subject Re: Update : CI windows-gpu Failure
Date Thu, 02 Apr 2020 03:38:05 GMT
Hello MXNet Community,

Since a week, CI is blocked due to Windows-GPU failure.
PR to fix it is still WIP :
https://github.com/apache/incubator-mxnet/pull/17808

This updates the toolchain from 32bit to 64bit [to resolve the 2GB memory
linker error currently facing CI]
Along with host of other updates that are long time coming -
[VSCode2019,opencv,cudnn,etc]
We have 2 pending issues:
1. cuda segfault in Py3 Windows GPU test
OSError: exception: access violation writing 0x0000000000000000

2. Jenkins Channel Connection
"hudson.remoting.ChannelClosedException: Channel
"hudson.remoting.Channel@5cca06e6:JNLP4-connect connection from [...]
failed. The channel is closing down or has closed down"

We are hard at work to unblock the CI & get the PR fix merged.

Since we want to focus on fixing the windows-gpu issue and avoid
complicating the situation further, we are not disabling the windows-gpu
build as of now. As a backup plan, we will disable the windows-gpu builds
by 4/5 Sunday EOD if things don’t recover by then.

Thanks for the continued patience.
Chai,
on behalf of the MXNet CI team



On Thu, 26 Mar 2020 at 21:16, Chaitanya Bapat <chai.bapat@gmail.com> wrote:

> Hello MXNet community,
>
> It’s been over 3 days now that windows-gpu builds are failing on CI.
> The team (me, Leo, Ningyuan, Joe, Pedro) are at work trying to identify
> the root-cause and fix.
>
> Issue: Linker is running OOM due to 32bit toolchain not able to address
> the available memory of the machine.
>
> Multiple attempts have been made (albeit with limited success)
> 1. Reduce the number of builds per worker (for window-cpu node) from 3 to 1
> 2. Updated the toolchain from 32bit to 64bit (as pointed out by multiple
> people)
> PR : https://github.com/apache/incubator-mxnet/pull/17916
> [related to Leo’s PR :
> https://github.com/apache/incubator-mxnet/pull/17912)
>
> Road to unblock:
> Updated AMI coupled with toolchain should possibly help
> Ningyuan has an updated AMI for windows (PR :
> https://github.com/apache/incubator-mxnet/pull/17808) - vs2019, cuda10.2,
> cmake fixes etc.
>
> We will get it deployed by tomorrow and update the status accordingly.
>
> Thanks for the patience. Apologies for the inconvenience caused.
> Thank you 🙏
> Chai,
> on behalf of the MXNet CI team
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> <https://github.com/ChaiBapchya>[image:
> https://www.facebook.com/chaibapat] <https://www.facebook.com/chaibapchya>[image:
> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
> https://www.linkedin.com//in/chaibapat25]
> <https://www.linkedin.com//in/chaibapchya/>
>


-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
<https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat]
<https://www.facebook.com/chaibapchya>[image:
https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
https://www.linkedin.com//in/chaibapat25]
<https://www.linkedin.com//in/chaibapchya/>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message