mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sandeep krishnamurthy <sandeep.krishn...@gmail.com>
Subject Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Date Thu, 27 Jun 2019 22:36:01 GMT
Hello Ciyong/Pedro,

Ran operator benchmarks on 1.4.1 and 1.5.0.rc2. (Not complete, doesn’t
cover all MXNet operators, not presented in best possible way, still WIP)
https://gist.github.com/sandeep-krishnamurthy/e0a2be893c8c4d484390c9c8813bdf50

Following operators looks slower in 1.5 compared to 1.4.1:
- BatchNorm
- Pooling
- FullyConnected
- batch_dot
- Dot
- broadcast_mul
- log_softmax
and few other operators

Also, several operators runs a lot faster on 1.5 compared to 1.4.1. For
example - Convolution, flatten, elementwise operators etc. So I see that
likely few operators have regressed noticeably, however, due to other
operator performance improvements, the end effect is not that significant
hiding a lot of regression. We need more detailed analysis per operator
performance. We will not be able to do this for current release, we should
have a more concrete way to determining such performance regression before
next release.

Setup:
1.5 => Build from source (head of 1.5.rc2 tag), built with MKLDNN
1.4.1 => PyPi mxnet-mkl==1.4.1
Machine: C5.18X
No explicit environment variable were set
Operator benchmark code -
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf

Best,
Sandeep


On Thu, Jun 27, 2019 at 10:42 AM Pedro Larroy <pedro.larroy.lists@gmail.com>
wrote:

> I will try to run a few benchmarks in a bare metal instance tonight to
> remove virtualization variance for the measurements and provide some
> numbers.
>
> Please propose a set of models / examples that would be desirable to
> run before the release and provide a link to an easy to run script
> with instructions so we can validate the release better.
>
> Thank you.
>
> On Thu, Jun 27, 2019 at 10:01 AM Lai Wei <royweilai@gmail.com> wrote:
> >
> > Dear @dev,
> >
> > I m cancelling the vote for cached op fix:
> >
> > https://github.com/apache/incubator-mxnet/pull/15298
> >
> > As for the possible cpu training regression, it looks like not a blocker
> > for now.
> >
> > I will start a new rc2 vote, please help to validate.
> >
> > Thanks!
> >
> >
> > On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong <ciyong.chen@intel.com>
> wrote:
> >
> > > Hi Pedro,
> > >
> > > I was able to reproduced the similar result (v1.5 is ~%5.6 slower than
> > > v1.4, I was using 18 cores for computing) with your script on
> C5.18xlarge.
> > > But need to bind the cores with below command when running the script,
> > > (without setting the env variables, I got a close time (<1%) with v1.5
> and
> > > v1.4)
> > >         export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
> > >         export OMP_NUM_THREADS=18
> > >
> > > Did you set any env variables during running?
> > >
> > > The performance result I got as below:
> > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > > real    12m10.856s
> > > user    234m49.576s
> > > sys     4m38.044s
> > >
> > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > > real    12m52.140s
> > > user    246m30.740s
> > > sys     5m8.188s
> > >
> > > As I looked at the profiling data, most of the ops have same perf
> between
> > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and "Pooling"
> is
> > > ~1.37x slower on v1.5 compared with v1.4.
> > > Will do further analysis on these ops.
> > >
> > > Here's the hardware/OS info from my side:
> > > ----------Python Info----------
> > > Version      : 3.6.8
> > > Compiler     : GCC 7.3.0
> > > Build        : ('default', 'Dec 30 2018 01:22:34')
> > > Arch         : ('64bit', '')
> > > ------------Pip Info-----------
> > > Version      : 19.0.3
> > > Directory    :
> > > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> > > ----------MXNet Info-----------
> > > Version      : 1.5.0
> > > Directory    : /home/ubuntu/ws/incubator-mxnet/python/mxnet
> > > Hashtag not found. Not installed from pre-built package.
> > > ----------System Info----------
> > > Platform     : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> > > system       : Linux
> > > node         : ip-172-31-32-129
> > > release      : 4.4.0-1085-aws
> > > version      : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> > > ----------Hardware Info----------
> > > machine      : x86_64
> > > processor    : x86_64
> > > Architecture:          x86_64
> > > CPU op-mode(s):        32-bit, 64-bit
> > > Byte Order:            Little Endian
> > > CPU(s):                72
> > > On-line CPU(s) list:   0-71
> > > Thread(s) per core:    2
> > > Core(s) per socket:    18
> > > Socket(s):             2
> > > NUMA node(s):          2
> > > Vendor ID:             GenuineIntel
> > > CPU family:            6
> > > Model:                 85
> > > Model name:            Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
> > > Stepping:              3
> > > CPU MHz:               3000.000
> > > BogoMIPS:              6000.00
> > > Hypervisor vendor:     KVM
> > > Virtualization type:   full
> > > L1d cache:             32K
> > > L1i cache:             32K
> > > L2 cache:              1024K
> > > L3 cache:              25344K
> > > NUMA node0 CPU(s):     0-17,36-53
> > > NUMA node1 CPU(s):     18-35,54-71
> > > Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
> pdpe1gb
> > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc
> > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid
> sse4_1
> > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
> > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase
> > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed
> adx
> > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
> > > ----------Network Test----------
> > >
> > >
> > > -Ciyong
> > >
> > >
> > > -----Original Message-----
> > > From: Zhao, Patric [mailto:patric.zhao@intel.com]
> > > Sent: Thursday, June 27, 2019 9:55 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Cc: dev@mxnet.apache.org
> > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> > >
> > > Could we run more epochs to see the performance difference or profiling
> > > the difference between good and bad run?
> > >
> > > > -----Original Message-----
> > > > From: Pedro Larroy [mailto:pedro.larroy.lists@gmail.com]
> > > > Sent: Thursday, June 27, 2019 9:35 AM
> > > > To: dev@mxnet.incubator.apache.org
> > > > Cc: dev@mxnet.apache.org
> > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version
> > > > 1.5.0.rc1
> > > >
> > > > I run again and the gap is again bigger, I guess we need to average
> > > > out the times across several runs:
> > > >
> > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5
> > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> > > > ImageRecordIOParser2:
> > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4
> threads
> > > > for decoding..
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> > > > ImageRecordIOParser2:
> > > > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads
> > > > for decoding..
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
> > > > lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123: 0.0005, 300:
> > > > 0.0001} Epoch 0, Changed learning rate to 0.05 [23:17:09]
> > > > ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > 147456 bytes with malloc directly
> > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > 589824 bytes with malloc directly
> > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > 2359296 bytes with malloc directly
> > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > 9437184 bytes with malloc directly
> > > > Epoch 0, Batch 199, Speed=384.149839
> > > > Epoch 0, Duration=140.919567
> > > > Epoch 0, Training accuracy=0.115169
> > > > Epoch 0, Validation accuracy=0.141317
> > > > Epoch 1, Batch 199, Speed=433.380512
> > > > Epoch 1, Duration=119.553233
> > > > Epoch 1, Training accuracy=0.170956
> > > > Epoch 1, Validation accuracy=0.216146
> > > > Epoch 2, Batch 199, Speed=434.864699
> > > > Epoch 2, Duration=123.278490
> > > > Epoch 2, Training accuracy=0.209455
> > > > Epoch 2, Validation accuracy=0.247296
> > > > Epoch 3, Batch 199, Speed=433.401854
> > > > Epoch 3, Duration=118.327797
> > > > Epoch 3, Training accuracy=0.248701
> > > > Epoch 3, Validation accuracy=0.302083
> > > > Epoch 4, Batch 199, Speed=419.713707
> > > > Epoch 4, Duration=126.468409
> > > > Epoch 4, Training accuracy=0.260949
> > > > Epoch 4, Validation accuracy=0.269030
> > > >
> > > > real    10m55.796s
> > > > user    399m33.567s
> > > > sys     13m55.904s
> > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:172:
> > > > ImageRecordIOParser2:
> > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4
> threads
> > > > for decoding..
> > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
> > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:172:
> > > > ImageRecordIOParser2:
> > > > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads
> > > > for decoding..
> > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
> > > > lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123: 0.0005, 300:
> > > > 0.0001} Epoch 0, Changed learning rate to 0.05 Epoch 0, Batch 199,
> > > > Speed=419.039188 Epoch 0, Duration=143.934903 Epoch 0, Training
> > > > accuracy=0.122542 Epoch 0, Validation accuracy=0.164359 Epoch 1,
> Batch
> > > > 199, Speed=445.257048 Epoch 1, Duration=135.248399 Epoch 1, Training
> > > > accuracy=0.178828 Epoch 1, Validation accuracy=0.199419 Epoch 2,
> Batch
> > > > 199, Speed=447.115215 Epoch 2, Duration=132.003770 Epoch 2, Training
> > > > accuracy=0.217808 Epoch 2, Validation accuracy=0.233073 Epoch 3,
> Batch
> > > > 199, Speed=441.079477 Epoch 3, Duration=126.543316 Epoch 3, Training
> > > > accuracy=0.248102 Epoch 3, Validation accuracy=0.293870 Epoch 4,
> Batch
> > > > 199, Speed=449.329787 Epoch 4, Duration=138.398325 Epoch 4, Training
> > > > accuracy=0.270021 Epoch 4, Validation accuracy=0.311498
> > > >
> > > > real    11m45.329s
> > > > user    426m13.908s
> > > > sys     16m45.093s
> > > >
> > > > On Wed, Jun 26, 2019 at 4:18 PM Pedro Larroy
> > > > <pedro.larroy.lists@gmail.com> wrote:
> > > > >
> > > > > The difference looks smaller now, more like your numbers. I wonder
> > > > > if something happened during the previous benchmark like a system
> > > > > update...
> > > > >
> > > > >
> > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> > > > (master)+$
> > > > > time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 && time
> > > > > ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 [22:49:41]
> > > > > ../src/io/iter_image_recordio_2.cc:172:
> > > > > ImageRecordIOParser2:
> > > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4
> > > > > threads for decoding..
> > > > > [22:49:41] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > > [22:49:41] ../src/io/iter_image_recordio_2.cc:248: Load mean image
> > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > completed
> > > > > [22:49:41] ../src/io/iter_image_recordio_2.cc:172:
> > > > > ImageRecordIOParser2:
> > > > > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4
> > > > > threads for decoding..
> > > > > [22:49:41] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > > [22:49:41] ../src/io/iter_image_recordio_2.cc:248: Load mean image
> > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > completed
> > > > > lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123: 0.0005, 300:
> > > > > 0.0001} Epoch 0, Changed learning rate to 0.05 [22:49:42]
> > > > > ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > > 147456 bytes with malloc directly
> > > > > [22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > > 589824 bytes with malloc directly
> > > > > [22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > > 2359296 bytes with malloc directly
> > > > > [22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > > 9437184 bytes with malloc directly
> > > > > Epoch 0, Batch 199, Speed=426.182733 Epoch 0, Duration=134.868458
> > > > > Epoch 0, Training accuracy=0.127238 Epoch 0, Validation
> > > > > accuracy=0.206388 Epoch 1, Batch 199, Speed=313.127156 Epoch 1,
> > > > > Duration=128.041775 Epoch 1, Training accuracy=0.182065 Epoch 1,
> > > > > Validation accuracy=0.202524 Epoch 2, Batch 199, Speed=410.931187
> > > > > Epoch 2, Duration=124.920588 Epoch 2, Training accuracy=0.202584
> > > > > Epoch 2, Validation accuracy=0.245693 Epoch 3, Batch 199,
> > > > > Speed=419.119335 Epoch 3, Duration=120.948349 Epoch 3, Training
> > > > > accuracy=0.235854 Epoch 3, Validation accuracy=0.291066 Epoch 4,
> > > > > Batch 199, Speed=430.473733 Epoch 4, Duration=130.181724 Epoch 4,
> > > > > Training accuracy=0.257773 Epoch 4, Validation accuracy=0.304988
> > > > >
> > > > > real    11m7.356s
> > > > > user    406m9.910s
> > > > > sys     14m18.349s
> > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:172:
> > > > > ImageRecordIOParser2:
> > > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4
> > > > > threads for decoding..
> > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:248: Load mean image
> > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > completed
> > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:172:
> > > > > ImageRecordIOParser2:
> > > > > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4
> > > > > threads for decoding..
> > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:248: Load mean image
> > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > completed
> > > > > lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123: 0.0005, 300:
> > > > > 0.0001} Epoch 0, Changed learning rate to 0.05 Epoch 0, Batch 199,
> > > > > Speed=348.618154 Epoch 0, Duration=146.469352 Epoch 0, Training
> > > > > accuracy=0.124121 Epoch 0, Validation accuracy=0.167227 Epoch 1,
> > > > > Batch 199, Speed=452.790825 Epoch 1, Duration=130.199421 Epoch 1,
> > > > > Training
> > > > > accuracy=0.183863 Epoch 1, Validation accuracy=0.237079 Epoch 2,
> > > > > Batch 199, Speed=451.406559 Epoch 2, Duration=126.320823 Epoch 2,
> > > > > Training
> > > > > accuracy=0.214844 Epoch 2, Validation accuracy=0.244692 Epoch 3,
> > > > > Batch 199, Speed=403.161873 Epoch 3, Duration=125.331660 Epoch 3,
> > > > > Training
> > > > > accuracy=0.243506 Epoch 3, Validation accuracy=0.301182 Epoch 4,
> > > > > Batch 199, Speed=450.826598 Epoch 4, Duration=126.426253 Epoch 4,
> > > > > Training
> > > > > accuracy=0.266424 Epoch 4, Validation accuracy=0.311899
> > > > >
> > > > > real    11m21.930s
> > > > > user    415m3.855s
> > > > > sys     13m53.975s
> > > > >
> > > > > On Wed, Jun 26, 2019 at 3:50 PM Pedro Larroy
> > > > > <pedro.larroy.lists@gmail.com> wrote:
> > > > > >
> > > > > > Hi Ciyong, thanks for trying to reproduce:
> > > > > >
> > > > > > I used this one:
> > > > > > https://github.com/awslabs/deeplearning-
> > > > benchmark/blob/master/dawnbe
> > > > > > nch/cifar10.py
> > > > > >
> > > > > > Could you provide hardware and OS details?
> > > > > >
> > > > > > I will rerun and repost numbers in a few minutes.
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > > > On Wed, Jun 26, 2019 at 4:18 AM Chen, Ciyong
> > > > > > <ciyong.chen@intel.com>
> > > > wrote:
> > > > > > >
> > > > > > > Hi Pedro,
> > > > > > >
> > > > > > > I'm looking at this case, and using the script of
> > > > > > > "incubator-mxnet/example/image-classification/train_cifar10.py"
> > > > > > > to get
> > > > the timing data, but seems there's not much difference between mxnet
> > > > 1.4.1.rc0 and 1.5.0.rc1 on C5.18xlarge.
> > > > > > >
> > > > > > > Not sure if there's any difference in the python script, can
> you
> > > > > > > point me
> > > > the link to get your script (cifar10.py)?
> > > > > > > Or you can also have a try with MXNet's script
> > > > > > > (train_cifar10.py) and see
> > > > the performance.
> > > > > > >
> > > > > > > Here's the command I used to collect the time:
> > > > > > >         python train_cifar10.py --num-epoch=5
> > > > > > >
> > > > > > > 1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > > > > > >         real    9m4.880s
> > > > > > >         user    333m13.340s
> > > > > > >         sys     14m36.100s
> > > > > > >
> > > > > > > 2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > > > > > >         real    9m2.155s
> > > > > > >         user    329m37.092s
> > > > > > >         sys     16m8.668s
> > > > > > >
> > > > > > > -Ciyong
> > > > > > >
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Pedro Larroy [mailto:pedro.larroy.lists@gmail.com]
> > > > > > > Sent: Wednesday, June 26, 2019 6:28 AM
> > > > > > > To: dev@mxnet.incubator.apache.org
> > > > > > > Cc: dev@mxnet.apache.org
> > > > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version
> > > > > > > 1.5.0.rc1
> > > > > > >
> > > > > > > Hi these were my build flags and system info:
> > > > > > >
> > > > > > >
> > > > > > > --- # CMake configuration
> > > > > > > USE_CUDA: "OFF" # Build with CUDA support
> > > > > > > USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
> > > > > > > USE_NCCL: "OFF" # Use NVidia NCCL with CUDA
> > > > > > > USE_OPENCV: "ON" # Build with OpenCV support
> > > > > > > USE_OPENMP: "ON" # Build with Openmp support
> > > > > > > USE_CUDNN: "ON" # Build with cudnn support) # one could set
> > > > > > > CUDNN_ROOT for search path
> > > > > > > USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT
> > > > > > > ARM
> > > > > > > USE_F16C: "ON" # Build with x86 F16C instruction support) #
> > > > autodetects support if "ON"
> > > > > > > USE_LAPACK: "ON" # Build with lapack support
> > > > > > > USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found
> > > > > > > USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found)
> > > > > > > IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > > > USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF
> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > > > USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF
> > > > NOT
> > > > > > > MSVC
> > > > > > > USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found)
> > > > > > > USE_JEMALLOC: "ON" # Build with Jemalloc support
> > > > > > > USE_PROFILER: "ON" # Build with Profiler support
> > > > > > > USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support
> > > > > > > USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
> > > > > > > USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
> > > > > > > USE_CPP_PACKAGE: "OFF" # Build C++ Package
> > > > > > > USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming
> > > > conventions.
> > > > > > > USE_GPROF: "OFF" # Compile with gprof (profiling) flag
> > > > > > > USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the
> compiler
> > > > > > > supports it
> > > > > > > USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) #
> > > > > > > one could set VTUNE_ROOT for search path
> > > > > > > ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation
> > > > > > > support
> > > > > > > BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
> > > > > > > INSTALL_EXAMPLES: "OFF" # Install the example source files.
> > > > > > > USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults.
> > > > > > > USE_TENSORRT: "OFF" # Enable infeference optimization with
> > > TensorRT.
> > > > > > > USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
> > > > > > > ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test
> > > > > > > coverage metric output
> > > > > > > CMAKE_BUILD_TYPE: "Release"
> > > > > > > CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
> > > > > > > CMAKE_C_COMPILER_LAUNCHER: "ccache"
> > > > > > > CMAKE_CXX_COMPILER_LAUNCHER: "ccache"
> > > > > > >
> > > > > > > commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD, tag:
> > > > > > > 1.5.0.rc1,
> > > > > > > upstream/v1.5.x)
> > > > > > > commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD, tag:
> > > > > > > 1.4.1.rc0,
> > > > > > > upstream/v1.4.x)
> > > > > > >
> > > > > > > curl http://169.254.169.254/latest/meta-data/instance-type
> > > > > > > c5d.18xlarge
> > > > > > >
> > > > > > >
> > > > > > > Version      : 3.6.7
> > > > > > > Compiler     : GCC 8.2.0
> > > > > > > Build        : ('default', 'Oct 22 2018 11:32:17')
> > > > > > > Arch         : ('64bit', 'ELF')
> > > > > > > ------------Pip Info-----------
> > > > > > > Version      : 19.1.1
> > > > > > > Directory    :
> /home/piotr/mxnet_1.5/py3_venv/lib/python3.6/site-
> > > > packages/pip
> > > > > > > ----------MXNet Info-----------
> > > > > > > Version      : 1.5.0
> > > > > > > Directory    : /home/piotr/mxnet_1.5/python/mxnet
> > > > > > > Hashtag not found. Not installed from pre-built package.
> > > > > > > ----------System Info----------
> > > > > > > Platform     :
> > > Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-18.04-bionic
> > > > > > > system       : Linux
> > > > > > > node         : ip-172-31-63-171
> > > > > > > release      : 4.15.0-1035-aws
> > > > > > > version      : #37-Ubuntu SMP Mon Mar 18 16:15:14 UTC 2019
> > > > > > > ----------Hardware Info----------
> > > > > > > machine      : x86_64
> > > > > > > processor    : x86_64
> > > > > > > Architecture:        x86_64
> > > > > > > CPU op-mode(s):      32-bit, 64-bit
> > > > > > > Byte Order:          Little Endian
> > > > > > > CPU(s):              72
> > > > > > > On-line CPU(s) list: 0-71
> > > > > > > Thread(s) per core:  2
> > > > > > > Core(s) per socket:  18
> > > > > > > Socket(s):           2
> > > > > > > NUMA node(s):        2
> > > > > > > Vendor ID:           GenuineIntel
> > > > > > > CPU family:          6
> > > > > > > Model:               85
> > > > > > > Model name:          Intel(R) Xeon(R) Platinum 8124M CPU @
> 3.00GHz
> > > > > > > Stepping:            4
> > > > > > > CPU MHz:             1326.446
> > > > > > > BogoMIPS:            6000.00
> > > > > > > Hypervisor vendor:   KVM
> > > > > > > Virtualization type: full
> > > > > > > L1d cache:           32K
> > > > > > > L1i cache:           32K
> > > > > > > L2 cache:            1024K
> > > > > > > L3 cache:            25344K
> > > > > > > NUMA node0 CPU(s):   0-17,36-53
> > > > > > > NUMA node1 CPU(s):   18-35,54-71
> > > > > > > Flags:               fpu vme de pse tsc msr pae mce cx8 apic
> sep
> > > mtrr
> > > > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall
> > > > > > > nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl
> > > > > > > xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq monitor
> > > > > > > ssse3 fma cx16 pcid
> > > > > > > sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave
> > > > > > > avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch
> > > > > > > invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2
> > > > > > > erms invpcid rtm mpx avx512f avx512dq rdseed adx smap
> clflushopt
> > > > > > > clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves
> > > > > > > ida arat pku ospke ----------Network Test----------
> > > > > > >
> > > > > > > ----------Python Info----------
> > > > > > > Version      : 3.6.7
> > > > > > > Compiler     : GCC 8.2.0
> > > > > > > Build        : ('default', 'Oct 22 2018 11:32:17')
> > > > > > > Arch         : ('64bit', 'ELF')
> > > > > > > ------------Pip Info-----------
> > > > > > > Version      : 19.1.1
> > > > > > > Directory    :
> /home/piotr/mxnet_1.4/py3_venv/lib/python3.6/site-
> > > > packages/pip
> > > > > > > ----------MXNet Info-----------
> > > > > > > Version      : 1.4.1
> > > > > > > Directory    : /home/piotr/mxnet_1.4/python/mxnet
> > > > > > > Hashtag not found. Not installed from pre-built package.
> > > > > > > ----------System Info----------
> > > > > > > Platform     :
> > > Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-18.04-bionic
> > > > > > > system       : Linux
> > > > > > > node         : ip-172-31-63-171
> > > > > > > release      : 4.15.0-1035-aws
> > > > > > > version      : #37-Ubuntu SMP Mon Mar 18 16:15:14 UTC 2019
> > > > > > > ----------Hardware Info----------
> > > > > > > machine      : x86_64
> > > > > > > processor    : x86_64
> > > > > > > Architecture:        x86_64
> > > > > > > CPU op-mode(s):      32-bit, 64-bit
> > > > > > > Byte Order:          Little Endian
> > > > > > > CPU(s):              72
> > > > > > > On-line CPU(s) list: 0-71
> > > > > > > Thread(s) per core:  2
> > > > > > > Core(s) per socket:  18
> > > > > > > Socket(s):           2
> > > > > > > NUMA node(s):        2
> > > > > > > Vendor ID:           GenuineIntel
> > > > > > > CPU family:          6
> > > > > > > Model:               85
> > > > > > > Model name:          Intel(R) Xeon(R) Platinum 8124M CPU @
> 3.00GHz
> > > > > > > Stepping:            4
> > > > > > > CPU MHz:             1223.344
> > > > > > > BogoMIPS:            6000.00
> > > > > > > Hypervisor vendor:   KVM
> > > > > > > Virtualization type: full
> > > > > > > L1d cache:           32K
> > > > > > > L1i cache:           32K
> > > > > > > L2 cache:            1024K
> > > > > > > L3 cache:            25344K
> > > > > > > NUMA node0 CPU(s):   0-17,36-53
> > > > > > > NUMA node1 CPU(s):   18-35,54-71
> > > > > > > Flags:               fpu vme de pse tsc msr pae mce cx8 apic
> sep
> > > mtrr
> > > > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall
> > > > > > > nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl
> > > > > > > xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq monitor
> > > > > > > ssse3 fma cx16 pcid
> > > > > > > sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave
> > > > > > > avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch
> > > > > > > invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2
> > > > > > > erms invpcid rtm mpx avx512f avx512dq rdseed adx smap
> clflushopt
> > > > > > > clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves
> > > > > > > ida arat pku ospke ----------Network Test----------
> > > > > > >
> > > > > > > On Tue, Jun 25, 2019 at 2:35 PM Pedro Larroy
> > > > <pedro.larroy.lists@gmail.com> wrote:
> > > > > > > >
> > > > > > > > I did a training of cifar10 in CPU and seems there's some
> > > > > > > > regressions in the range of 7% increase of training time
> against
> > > 1.4.1:
> > > > > > > >
> > > > > > > > (py3_venv)
> > > > > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> > > > > > > > (master)+$ time python cifar10.py --epochs 5
> > > > > > > > real    11m30.388s
> > > > > > > > user    417m7.766s
> > > > > > > > sys     16m57.315s
> > > > > > > >
> > > > > > > > VS 1.4.1:
> > > > > > > > real    10m41.994s
> > > > > > > > user    392m40.646s
> > > > > > > > sys     12m30.601s
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Jun 20, 2019 at 10:15 PM Lai Wei <
> royweilai@gmail.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Anirudh,
> > > > > > > > >
> > > > > > > > > Thanks for jumping into this quickly, I followed up on the
> > > issue.
> > > > > > > > >
> > > > > > > > > I was meant for sockeye developer/maintainers to help setup
> > > > > > > > > nightly tests and raise issues early.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin
> > > > > > > > > <haibin.lin.aws@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > In GluonNLP we are testing with MXNET nightly build for
> > > > > > > > > > each PR, and we did find some MXNet related issue caught
> by
> > > the CI.
> > > > > > > > > > I recommend other toolkits also add integration tests
> with
> > > > > > > > > > MXNet
> > > > nightly.
> > > > > > > > > > It helps identify issues early.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Haibin
> > > > > > > > > >
> > > > > > > > > > On Thu, Jun 20, 2019 at 18:52 Zhao, Patric
> > > > > > > > > > <patric.zhao@intel.com>
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks to raise the issue and we will take a look ASAP.
> > > > > > > > > > >
> > > > > > > > > > > The downstream cases is not in the MXNet CI so it's
> hard
> > > > > > > > > > > to catch the potential bugs or performance degradation
> > > > > > > > > > > for
> > > > MXNet developers.
> > > > > > > > > > >
> > > > > > > > > > > In the future, I suggest adding the major downstream
> > > > > > > > > > > test cases, like
> > > > > > > > > > from
> > > > > > > > > > > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the
> > > > > > > > > > > nightly
> > > > test.
> > > > > > > > > > > If it's still too heavy,  maybe testing it weekly or
> > > > > > > > > > > monthly :)
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > --Patric
> > > > > > > > > > >
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Anirudh Subramanian
> > > > > > > > > > > > [mailto:anirudh2290@gmail.com]
> > > > > > > > > > > > Sent: Friday, June 21, 2019 9:31 AM
> > > > > > > > > > > > To: dev@mxnet.incubator.apache.org
> > > > > > > > > > > > Cc: dev@mxnet.apache.org
> > > > > > > > > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating)
> > > > > > > > > > > > version
> > > > > > > > > > > > 1.5.0.rc1
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Lai,
> > > > > > > > > > > >
> > > > > > > > > > > > I have opened an issue:
> > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/issues/15297
> > > > > > > > > > > > I came to know about this issue only today and I have
> > > > > > > > > > > > not been
> > > > > > > > > > monitoring
> > > > > > > > > > > > sockeye.
> > > > > > > > > > > > I jumped onto this issue to make sure it wasn't
> caused
> > > > > > > > > > > > by the dlpack
> > > > > > > > > > > changes.
> > > > > > > > > > > > Also, I don't  think sockeye CI checks against
> master,
> > > > > > > > > > > > it is using
> > > > > > > > > > 1.4.1.
> > > > > > > > > > > >
> > > > > > > > > > > > Anirudh
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei
> > > > > > > > > > > > <royweilai@gmail.com>
> > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Could you share which test failed and what’s the
> > > > > > > > > > > > > crash? How to reproduce it?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I was able to install sockeye and run all tests
> passed.
> > > > > > > > > > > > > Using python setup.py test
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have tested both nightly pip package and
> 1.5.0.rc1
> > > > > > > > > > > > >
> > > > > > > > > > > > > It would be great to create an issue with
> > > > > > > > > > > > > reproducible steps and move the discussion there.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Also I see sockeye nightly build[1] has been
> failing
> > > > > > > > > > > > > for some time,
> > > > > > > > > > if
> > > > > > > > > > > > > it’s due to MXNet change, please raise this early
> so
> > > > > > > > > > > > > we can track and solve it in time rather than block
> > > > > > > > > > > > > the release
> > > > during vote time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1] https://travis-ci.org/awslabs/sockeye
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian
> > > > > > > > > > > > > <anirudh2290@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I was able to reproduce a crash with the commit
> > > > > > > > > > > > > > 09202f7f261954383aa387144524d38f83f18d06 but not
> > > > > > > > > > > > > > with the commit
> > > > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Anirudh
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei
> > > > > > > > > > > > > > <royweilai@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Przemyslaw,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Is there an issue with more details to track
> the
> > > problem?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław
> > > > > > > > > > > > > > > Trędak <ptrendx@apache.org>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -1
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > There is a crash in sockeye unit test (python
> > > > > > > > > > > > > > > > setup.py
> > > > > > > > > > > > > > > > test) observed starting with nightly 1.5
> build
> > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > 6/13 and still occuring in
> > > > > > > > > > > > > > 1.5rc1. I
> > > > > > > > > > > > > > > > don't yet have the exact commit that is
> > > > > > > > > > > > > > > > responsible for it, but it is either
> > > > > > > > > > > > > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c
> > > > > > > > > > > > > > > > (dlpack
> > > > > > > > > > > > > > > > related) or
> > > > > > > > > > > > > > > > 09202f7f261954383aa387144524d38f83f18d06
> > > > > > > > > > > > > > > > (cached op
> > > > > > > > > > > > optimization).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On 2019/06/20 06:36:22, Lai Wei
> > > > > > > > > > > > > > > > <royweilai@gmail.com>
> > > > wrote:
> > > > > > > > > > > > > > > > > Dear MXNet community,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This is the 3-day vote to release Apache
> > > > > > > > > > > > > > > > > MXNet
> > > > > > > > > > > > > > > > > (incubating) version
> > > > > > > > > > > > > > > > 1.5.0.
> > > > > > > > > > > > > > > > > Voting on dev@ will start June 19,
> > > > > > > > > > > > > > > > > 23:59:59(PST) and close
> > > > > > > > > > on
> > > > > > > > > > > > > June
> > > > > > > > > > > > > > > 22,
> > > > > > > > > > > > > > > > > 23:59:59.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 1) Link to release notes:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Re
> > > > > > > > > > le
> > > > > > > > > > ase+No
> > > > > > > > > > te
> > > > > > > > > > > > > > s
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 2) Link to release candidate:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > >
> https://github.com/apache/incubator-mxnet/releases/tag/1.5
> > > > > > > > > > .0
> > > > > > > > > > .r
> > > > > > > > > > > > > > > > > c1
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 3) Link to source and signatures on apache
> > > dist server:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > >
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5
> > > > > > > > > > .0
> > > > > > > > > > .r
> > > > > > > > > > > > > > > > > c1/
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Please remember to TEST first before voting
> > > > accordingly:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > +1 = approve
> > > > > > > > > > > > > > > > > +0 = no opinion
> > > > > > > > > > > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Lai
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Lai
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lai
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best Regards
> > > > > > > > >
> > > > > > > > > Lai
> > >
> > --
> > Best Regards
> >
> > Lai
>
>

-- 
Sandeep Krishnamurthy

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message