mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lai Wei <roywei...@gmail.com>
Subject Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Date Fri, 28 Jun 2019 16:38:21 GMT
Hi,

Some more data points:

I ran the same cifar10.py scripts with same setup, BUT added a fixed seed

Ran 50 epochs, and first 10 epoch as warmup.
I have the following average time per epoch:
1.4.1: 164.95 s
1.5.0: 170.44 s
Detailed data at [1]
This is about 3% regression, less than Manu’s result but more close to the
Gluon result.

As for the operator benchmarks from Sandeep[2],  I have calculated the
percentage of speed increase/regression here[1]. Looks like not all
operators mentioned before slowed down. should it be treated as an separate
issue as it’s testing on fake data with different shape than CIFAR10
dataset? For example, batch norm has no regression in the report but it’s
slowed down in cifar10.py script profiling.

[1] https://gist.github.com/roywei/41fce930f013ff3b54cda6e86eaaf66b
[2]
https://gist.github.com/sandeep-krishnamurthy/e0a2be893c8c4d484390c9c8813bdf50


On Fri, Jun 28, 2019 at 2:47 PM Pedro Larroy <pedro.larroy.lists@gmail.com>
wrote:

> Thanks Manu.
>
> @all: I observed other strange stuff that I don't understand at the moment:
>
> I installed rc for 1.5 from pip to check that I'm not doing something
> wrong when building. And I found out that the usage of CPU is quite
> subpar ( https://imgur.com/fRmbQNc ) compared to a version compiled
> from source. The pip package is using 4-5 cores of the 32. When I
> compile from source I get good core utilization. (
> https://imgur.com/e8BB425 ). I verified this also on a c5d.18xlarge
> and a 32 core AMD bare metal machine.
>
> Seems to me also that the version from pip is using gomp instead of
> llvm's omp. I'm not sure why.
>
> pip install mxnet==1.5.0b20190627
> /home/piotr/py3_1.5rc/lib/python3.6/site-packages/mxnet
> piotr@panther:0: ~/p/l/p/s/mxnet> ldd libmxnet.so | grep omp
>     libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x00007f99d1832000)
>
> I tried cifar10 on a bare metal 32 core AMD Zen machine and is
> extremely slow, doesn't seem to make much progress, when compared to a
> c5d.18xlarge, I couldn't even do 1 epoch, tried with and without MKL
> without much success. Will continue digging into this when possible.
>
>
> Pedro.
>
> On Thu, Jun 27, 2019 at 9:41 PM Manu Seth <manuseth1010@gmail.com> wrote:
> >
> > Hi all,
> >
> > I ran the same cifar10.py script as Pedro, but for 20 epochs. Considering
> > the first 10 epochs for warm-up, I averaged time per epoch for the last
> 10
> > epochs.
> >
> > With MXNet 1.4.1 average time is 164.23 s
> > With MXNet 1.5.0 average time is 174.59 s (~6.3% regression)
> >
> >
> > For a second data point, I ran Gluon speed test benchmark script -
> >
> https://github.com/apache/incubator-mxnet/blob/master/benchmark/python/gluon/benchmark_gluon.py
> > using the following command:
> > python3 benchmark_gluon.py --model 'resnet152_v2' --batch-size 128
> > --num-batches 200 --type 'training'
> >
> > I got the following speeds:
> > With MXNet 1.4.1, average speed is 25.677534 img/s
> > With MXNet 1.5.0, average speed is 25.082130 img/s (~2.3% regression)
> >
> > Note:
> > For 1.4.1 version, I used pip install mxnet-mkl==1.4.1
> > For 1.5.0 version, I used pip install mxnet-mkl==1.5.0b20190619 which
> > corresponds to commit# ccbbf6b4b76ea536a6583c99497c83b65a20817b which is
> > behind 1.5.x branch by 4 commits
> >
> >
> > Best,
> > Manu
> >
> >
> > On 6/27/19, 3:37 PM, "sandeep krishnamurthy" <
> sandeep.krishna98@gmail.com>
> > wrote:
> >
> >     Hello Ciyong/Pedro,
> >
> >     Ran operator benchmarks on 1.4.1 and 1.5.0.rc2. (Not complete,
> doesn’t
> >     cover all MXNet operators, not presented in best possible way, still
> > WIP)
> >
> >
> https://gist.github.com/sandeep-krishnamurthy/e0a2be893c8c4d484390c9c8813bdf50
> >
> >     Following operators looks slower in 1.5 compared to 1.4.1:
> >     - BatchNorm
> >     - Pooling
> >     - FullyConnected
> >     - batch_dot
> >     - Dot
> >     - broadcast_mul
> >     - log_softmax
> >     and few other operators
> >
> >     Also, several operators runs a lot faster on 1.5 compared to 1.4.1.
> For
> >     example - Convolution, flatten, elementwise operators etc. So I see
> that
> >     likely few operators have regressed noticeably, however, due to other
> >     operator performance improvements, the end effect is not that
> > significant
> >     hiding a lot of regression. We need more detailed analysis per
> operator
> >     performance. We will not be able to do this for current release, we
> > should
> >     have a more concrete way to determining such performance regression
> > before
> >     next release.
> >
> >     Setup:
> >     1.5 => Build from source (head of 1.5.rc2 tag), built with MKLDNN
> >     1.4.1 => PyPi mxnet-mkl==1.4.1
> >     Machine: C5.18X
> >     No explicit environment variable were set
> >     Operator benchmark code -
> >
> https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf
> >
> >     Best,
> >     Sandeep
> >
> >
> >     On Thu, Jun 27, 2019 at 10:42 AM Pedro Larroy <
> > pedro.larroy.lists@gmail.com>
> >     wrote:
> >
> >     > I will try to run a few benchmarks in a bare metal instance
> tonight to
> >     > remove virtualization variance for the measurements and provide
> some
> >     > numbers.
> >     >
> >     > Please propose a set of models / examples that would be desirable
> to
> >     > run before the release and provide a link to an easy to run script
> >     > with instructions so we can validate the release better.
> >     >
> >     > Thank you.
> >     >
> >     > On Thu, Jun 27, 2019 at 10:01 AM Lai Wei <royweilai@gmail.com>
> wrote:
> >     > >
> >     > > Dear @dev,
> >     > >
> >     > > I m cancelling the vote for cached op fix:
> >     > >
> >     > > https://github.com/apache/incubator-mxnet/pull/15298
> >     > >
> >     > > As for the possible cpu training regression, it looks like not a
> > blocker
> >     > > for now.
> >     > >
> >     > > I will start a new rc2 vote, please help to validate.
> >     > >
> >     > > Thanks!
> >     > >
> >     > >
> >     > > On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong <
> ciyong.chen@intel.com
> > >
> >     > wrote:
> >     > >
> >     > > > Hi Pedro,
> >     > > >
> >     > > > I was able to reproduced the similar result (v1.5 is ~%5.6
> slower
> > than
> >     > > > v1.4, I was using 18 cores for computing) with your script on
> >     > C5.18xlarge.
> >     > > > But need to bind the cores with below command when running the
> > script,
> >     > > > (without setting the env variables, I got a close time (<1%)
> with
> > v1.5
> >     > and
> >     > > > v1.4)
> >     > > >         export
> > KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
> >     > > >         export OMP_NUM_THREADS=18
> >     > > >
> >     > > > Did you set any env variables during running?
> >     > > >
> >     > > > The performance result I got as below:
> >     > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> >     > > > real    12m10.856s
> >     > > > user    234m49.576s
> >     > > > sys     4m38.044s
> >     > > >
> >     > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> >     > > > real    12m52.140s
> >     > > > user    246m30.740s
> >     > > > sys     5m8.188s
> >     > > >
> >     > > > As I looked at the profiling data, most of the ops have same
> perf
> >     > between
> >     > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and
> > "Pooling"
> >     > is
> >     > > > ~1.37x slower on v1.5 compared with v1.4.
> >     > > > Will do further analysis on these ops.
> >     > > >
> >     > > > Here's the hardware/OS info from my side:
> >     > > > ----------Python Info----------
> >     > > > Version      : 3.6.8
> >     > > > Compiler     : GCC 7.3.0
> >     > > > Build        : ('default', 'Dec 30 2018 01:22:34')
> >     > > > Arch         : ('64bit', '')
> >     > > > ------------Pip Info-----------
> >     > > > Version      : 19.0.3
> >     > > > Directory    :
> >     > > >
> > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> >     > > > ----------MXNet Info-----------
> >     > > > Version      : 1.5.0
> >     > > > Directory    : /home/ubuntu/ws/incubator-mxnet/python/mxnet
> >     > > > Hashtag not found. Not installed from pre-built package.
> >     > > > ----------System Info----------
> >     > > > Platform     :
> Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> >     > > > system       : Linux
> >     > > > node         : ip-172-31-32-129
> >     > > > release      : 4.4.0-1085-aws
> >     > > > version      : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> >     > > > ----------Hardware Info----------
> >     > > > machine      : x86_64
> >     > > > processor    : x86_64
> >     > > > Architecture:          x86_64
> >     > > > CPU op-mode(s):        32-bit, 64-bit
> >     > > > Byte Order:            Little Endian
> >     > > > CPU(s):                72
> >     > > > On-line CPU(s) list:   0-71
> >     > > > Thread(s) per core:    2
> >     > > > Core(s) per socket:    18
> >     > > > Socket(s):             2
> >     > > > NUMA node(s):          2
> >     > > > Vendor ID:             GenuineIntel
> >     > > > CPU family:            6
> >     > > > Model:                 85
> >     > > > Model name:            Intel(R) Xeon(R) Platinum 8124M CPU @
> > 3.00GHz
> >     > > > Stepping:              3
> >     > > > CPU MHz:               3000.000
> >     > > > BogoMIPS:              6000.00
> >     > > > Hypervisor vendor:     KVM
> >     > > > Virtualization type:   full
> >     > > > L1d cache:             32K
> >     > > > L1i cache:             32K
> >     > > > L2 cache:              1024K
> >     > > > L3 cache:              25344K
> >     > > > NUMA node0 CPU(s):     0-17,36-53
> >     > > > NUMA node1 CPU(s):     18-35,54-71
> >     > > > Flags:                 fpu vme de pse tsc msr pae mce cx8 apic
> > sep mtrr
> >     > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall
> nx
> >     > pdpe1gb
> >     > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
> > nonstop_tsc
> >     > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16
> > pcid
> >     > sse4_1
> >     > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
> f16c
> > rdrand
> >     > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser
> > fsgsbase
> >     > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f
> > rdseed
> >     > adx
> >     > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat
> pku
> >     > > > ----------Network Test----------
> >     > > >
> >     > > >
> >     > > > -Ciyong
> >     > > >
> >     > > >
> >     > > > -----Original Message-----
> >     > > > From: Zhao, Patric [mailto:patric.zhao@intel.com]
> >     > > > Sent: Thursday, June 27, 2019 9:55 AM
> >     > > > To: dev@mxnet.incubator.apache.org
> >     > > > Cc: dev@mxnet.apache.org
> >     > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version
> > 1.5.0.rc1
> >     > > >
> >     > > > Could we run more epochs to see the performance difference or
> > profiling
> >     > > > the difference between good and bad run?
> >     > > >
> >     > > > > -----Original Message-----
> >     > > > > From: Pedro Larroy [mailto:pedro.larroy.lists@gmail.com]
> >     > > > > Sent: Thursday, June 27, 2019 9:35 AM
> >     > > > > To: dev@mxnet.incubator.apache.org
> >     > > > > Cc: dev@mxnet.apache.org
> >     > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version
> >     > > > > 1.5.0.rc1
> >     > > > >
> >     > > > > I run again and the gap is again bigger, I guess we need to
> > average
> >     > > > > out the times across several runs:
> >     > > > >
> >     > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> >     > > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py
> > --epochs 5
> >     > > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5
> >     > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> >     > > > > ImageRecordIOParser2:
> >     > > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use
> 4
> >     > threads
> >     > > > > for decoding..
> >     > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean
> > image
> >     > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean
> > image
> >     > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > completed
> >     > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> >     > > > > ImageRecordIOParser2:
> >     > > > > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4
> > threads
> >     > > > > for decoding..
> >     > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean
> > image
> >     > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean
> > image
> >     > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > completed
> >     > > > > lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123: 0.0005,
> > 300:
> >     > > > > 0.0001} Epoch 0, Changed learning rate to 0.05 [23:17:09]
> >     > > > > ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> >     > > > > 147456 bytes with malloc directly
> >     > > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74:
> Allocate
> >     > > > > 589824 bytes with malloc directly
> >     > > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74:
> Allocate
> >     > > > > 2359296 bytes with malloc directly
> >     > > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74:
> Allocate
> >     > > > > 9437184 bytes with malloc directly
> >     > > > > Epoch 0, Batch 199, Speed=384.149839
> >     > > > > Epoch 0, Duration=140.919567
> >     > > > > Epoch 0, Training accuracy=0.115169
> >     > > > > Epoch 0, Validation accuracy=0.141317
> >     > > > > Epoch 1, Batch 199, Speed=433.380512
> >     > > > > Epoch 1, Duration=119.553233
> >     > > > > Epoch 1, Training accuracy=0.170956
> >     > > > > Epoch 1, Validation accuracy=0.216146
> >     > > > > Epoch 2, Batch 199, Speed=434.864699
> >     > > > > Epoch 2, Duration=123.278490
> >     > > > > Epoch 2, Training accuracy=0.209455
> >     > > > > Epoch 2, Validation accuracy=0.247296
> >     > > > > Epoch 3, Batch 199, Speed=433.401854
> >     > > > > Epoch 3, Duration=118.327797
> >     > > > > Epoch 3, Training accuracy=0.248701
> >     > > > > Epoch 3, Validation accuracy=0.302083
> >     > > > > Epoch 4, Batch 199, Speed=419.713707
> >     > > > > Epoch 4, Duration=126.468409
> >     > > > > Epoch 4, Training accuracy=0.260949
> >     > > > > Epoch 4, Validation accuracy=0.269030
> >     > > > >
> >     > > > > real    10m55.796s
> >     > > > > user    399m33.567s
> >     > > > > sys     13m55.904s
> >     > > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:172:
> >     > > > > ImageRecordIOParser2:
> >     > > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use
> 4
> >     > threads
> >     > > > > for decoding..
> >     > > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean
> > image
> >     > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean
> > image
> >     > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > completed
> >     > > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:172:
> >     > > > > ImageRecordIOParser2:
> >     > > > > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4
> > threads
> >     > > > > for decoding..
> >     > > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean
> > image
> >     > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > [23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean
> > image
> >     > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > completed
> >     > > > > lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123: 0.0005,
> > 300:
> >     > > > > 0.0001} Epoch 0, Changed learning rate to 0.05 Epoch 0, Batch
> > 199,
> >     > > > > Speed=419.039188 Epoch 0, Duration=143.934903 Epoch 0,
> Training
> >     > > > > accuracy=0.122542 Epoch 0, Validation accuracy=0.164359
> Epoch 1,
> >     > Batch
> >     > > > > 199, Speed=445.257048 Epoch 1, Duration=135.248399 Epoch 1,
> > Training
> >     > > > > accuracy=0.178828 Epoch 1, Validation accuracy=0.199419
> Epoch 2,
> >     > Batch
> >     > > > > 199, Speed=447.115215 Epoch 2, Duration=132.003770 Epoch 2,
> > Training
> >     > > > > accuracy=0.217808 Epoch 2, Validation accuracy=0.233073
> Epoch 3,
> >     > Batch
> >     > > > > 199, Speed=441.079477 Epoch 3, Duration=126.543316 Epoch 3,
> > Training
> >     > > > > accuracy=0.248102 Epoch 3, Validation accuracy=0.293870
> Epoch 4,
> >     > Batch
> >     > > > > 199, Speed=449.329787 Epoch 4, Duration=138.398325 Epoch 4,
> > Training
> >     > > > > accuracy=0.270021 Epoch 4, Validation accuracy=0.311498
> >     > > > >
> >     > > > > real    11m45.329s
> >     > > > > user    426m13.908s
> >     > > > > sys     16m45.093s
> >     > > > >
> >     > > > > On Wed, Jun 26, 2019 at 4:18 PM Pedro Larroy
> >     > > > > <pedro.larroy.lists@gmail.com> wrote:
> >     > > > > >
> >     > > > > > The difference looks smaller now, more like your numbers. I
> > wonder
> >     > > > > > if something happened during the previous benchmark like a
> > system
> >     > > > > > update...
> >     > > > > >
> >     > > > > >
> >     > > > > > piotr@ip-172-31-63-171
> :0:~/deeplearning-benchmark/dawnbench
> >     > > > > (master)+$
> >     > > > > > time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5
> &&
> > time
> >     > > > > > ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5
> > [22:49:41]
> >     > > > > > ../src/io/iter_image_recordio_2.cc:172:
> >     > > > > > ImageRecordIOParser2:
> >     > > > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec,
> use 4
> >     > > > > > threads for decoding..
> >     > > > > > [22:49:41] ../src/io/iter_image_recordio_2.cc:230: Load
> mean
> > image
> >     > > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > > [22:49:41] ../src/io/iter_image_recordio_2.cc:248: Load
> mean
> > image
> >     > > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > completed
> >     > > > > > [22:49:41] ../src/io/iter_image_recordio_2.cc:172:
> >     > > > > > ImageRecordIOParser2:
> >     > > > > > /home/piotr/deeplearning-benchmark/data/cifar/test.rec,
> use 4
> >     > > > > > threads for decoding..
> >     > > > > > [22:49:41] ../src/io/iter_image_recordio_2.cc:230: Load
> mean
> > image
> >     > > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > > [22:49:41] ../src/io/iter_image_recordio_2.cc:248: Load
> mean
> > image
> >     > > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > completed
> >     > > > > > lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123:
> 0.0005,
> > 300:
> >     > > > > > 0.0001} Epoch 0, Changed learning rate to 0.05 [22:49:42]
> >     > > > > > ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> >     > > > > > 147456 bytes with malloc directly
> >     > > > > > [22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74:
> > Allocate
> >     > > > > > 589824 bytes with malloc directly
> >     > > > > > [22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74:
> > Allocate
> >     > > > > > 2359296 bytes with malloc directly
> >     > > > > > [22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74:
> > Allocate
> >     > > > > > 9437184 bytes with malloc directly
> >     > > > > > Epoch 0, Batch 199, Speed=426.182733 Epoch 0,
> > Duration=134.868458
> >     > > > > > Epoch 0, Training accuracy=0.127238 Epoch 0, Validation
> >     > > > > > accuracy=0.206388 Epoch 1, Batch 199, Speed=313.127156
> Epoch
> > 1,
> >     > > > > > Duration=128.041775 Epoch 1, Training accuracy=0.182065
> Epoch
> > 1,
> >     > > > > > Validation accuracy=0.202524 Epoch 2, Batch 199,
> > Speed=410.931187
> >     > > > > > Epoch 2, Duration=124.920588 Epoch 2, Training
> > accuracy=0.202584
> >     > > > > > Epoch 2, Validation accuracy=0.245693 Epoch 3, Batch 199,
> >     > > > > > Speed=419.119335 Epoch 3, Duration=120.948349 Epoch 3,
> > Training
> >     > > > > > accuracy=0.235854 Epoch 3, Validation accuracy=0.291066
> Epoch
> > 4,
> >     > > > > > Batch 199, Speed=430.473733 Epoch 4, Duration=130.181724
> > Epoch 4,
> >     > > > > > Training accuracy=0.257773 Epoch 4, Validation
> > accuracy=0.304988
> >     > > > > >
> >     > > > > > real    11m7.356s
> >     > > > > > user    406m9.910s
> >     > > > > > sys     14m18.349s
> >     > > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:172:
> >     > > > > > ImageRecordIOParser2:
> >     > > > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec,
> use 4
> >     > > > > > threads for decoding..
> >     > > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:230: Load
> mean
> > image
> >     > > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:248: Load
> mean
> > image
> >     > > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > completed
> >     > > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:172:
> >     > > > > > ImageRecordIOParser2:
> >     > > > > > /home/piotr/deeplearning-benchmark/data/cifar/test.rec,
> use 4
> >     > > > > > threads for decoding..
> >     > > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:230: Load
> mean
> > image
> >     > > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > > [23:00:49] ../src/io/iter_image_recordio_2.cc:248: Load
> mean
> > image
> >     > > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> >     > > > > completed
> >     > > > > > lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123:
> 0.0005,
> > 300:
> >     > > > > > 0.0001} Epoch 0, Changed learning rate to 0.05 Epoch 0,
> Batch
> > 199,
> >     > > > > > Speed=348.618154 Epoch 0, Duration=146.469352 Epoch 0,
> > Training
> >     > > > > > accuracy=0.124121 Epoch 0, Validation accuracy=0.167227
> Epoch
> > 1,
> >     > > > > > Batch 199, Speed=452.790825 Epoch 1, Duration=130.199421
> > Epoch 1,
> >     > > > > > Training
> >     > > > > > accuracy=0.183863 Epoch 1, Validation accuracy=0.237079
> Epoch
> > 2,
> >     > > > > > Batch 199, Speed=451.406559 Epoch 2, Duration=126.320823
> > Epoch 2,
> >     > > > > > Training
> >     > > > > > accuracy=0.214844 Epoch 2, Validation accuracy=0.244692
> Epoch
> > 3,
> >     > > > > > Batch 199, Speed=403.161873 Epoch 3, Duration=125.331660
> > Epoch 3,
> >     > > > > > Training
> >     > > > > > accuracy=0.243506 Epoch 3, Validation accuracy=0.301182
> Epoch
> > 4,
> >     > > > > > Batch 199, Speed=450.826598 Epoch 4, Duration=126.426253
> > Epoch 4,
> >     > > > > > Training
> >     > > > > > accuracy=0.266424 Epoch 4, Validation accuracy=0.311899
> >     > > > > >
> >     > > > > > real    11m21.930s
> >     > > > > > user    415m3.855s
> >     > > > > > sys     13m53.975s
> >     > > > > >
> >     > > > > > On Wed, Jun 26, 2019 at 3:50 PM Pedro Larroy
> >     > > > > > <pedro.larroy.lists@gmail.com> wrote:
> >     > > > > > >
> >     > > > > > > Hi Ciyong, thanks for trying to reproduce:
> >     > > > > > >
> >     > > > > > > I used this one:
> >     > > > > > > https://github.com/awslabs/deeplearning-
> >     > > > > benchmark/blob/master/dawnbe
> >     > > > > > > nch/cifar10.py
> >     > > > > > >
> >     > > > > > > Could you provide hardware and OS details?
> >     > > > > > >
> >     > > > > > > I will rerun and repost numbers in a few minutes.
> >     > > > > > >
> >     > > > > > > Pedro.
> >     > > > > > >
> >     > > > > > > On Wed, Jun 26, 2019 at 4:18 AM Chen, Ciyong
> >     > > > > > > <ciyong.chen@intel.com>
> >     > > > > wrote:
> >     > > > > > > >
> >     > > > > > > > Hi Pedro,
> >     > > > > > > >
> >     > > > > > > > I'm looking at this case, and using the script of
> >     > > > > > > >
> > "incubator-mxnet/example/image-classification/train_cifar10.py"
> >     > > > > > > > to get
> >     > > > > the timing data, but seems there's not much difference
> between
> > mxnet
> >     > > > > 1.4.1.rc0 and 1.5.0.rc1 on C5.18xlarge.
> >     > > > > > > >
> >     > > > > > > > Not sure if there's any difference in the python
> script,
> > can
> >     > you
> >     > > > > > > > point me
> >     > > > > the link to get your script (cifar10.py)?
> >     > > > > > > > Or you can also have a try with MXNet's script
> >     > > > > > > > (train_cifar10.py) and see
> >     > > > > the performance.
> >     > > > > > > >
> >     > > > > > > > Here's the command I used to collect the time:
> >     > > > > > > >         python train_cifar10.py --num-epoch=5
> >     > > > > > > >
> >     > > > > > > > 1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> >     > > > > > > >         real    9m4.880s
> >     > > > > > > >         user    333m13.340s
> >     > > > > > > >         sys     14m36.100s
> >     > > > > > > >
> >     > > > > > > > 2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> >     > > > > > > >         real    9m2.155s
> >     > > > > > > >         user    329m37.092s
> >     > > > > > > >         sys     16m8.668s
> >     > > > > > > >
> >     > > > > > > > -Ciyong
> >     > > > > > > >
> >     > > > > > > >
> >     > > > > > > > -----Original Message-----
> >     > > > > > > > From: Pedro Larroy [mailto:
> pedro.larroy.lists@gmail.com]
> >     > > > > > > > Sent: Wednesday, June 26, 2019 6:28 AM
> >     > > > > > > > To: dev@mxnet.incubator.apache.org
> >     > > > > > > > Cc: dev@mxnet.apache.org
> >     > > > > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating)
> > version
> >     > > > > > > > 1.5.0.rc1
> >     > > > > > > >
> >     > > > > > > > Hi these were my build flags and system info:
> >     > > > > > > >
> >     > > > > > > >
> >     > > > > > > > --- # CMake configuration
> >     > > > > > > > USE_CUDA: "OFF" # Build with CUDA support
> >     > > > > > > > USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
> >     > > > > > > > USE_NCCL: "OFF" # Use NVidia NCCL with CUDA
> >     > > > > > > > USE_OPENCV: "ON" # Build with OpenCV support
> >     > > > > > > > USE_OPENMP: "ON" # Build with Openmp support
> >     > > > > > > > USE_CUDNN: "ON" # Build with cudnn support) # one could
> > set
> >     > > > > > > > CUDNN_ROOT for search path
> >     > > > > > > > USE_SSE: "ON" # Build with x86 SSE instruction support
> IF
> > NOT
> >     > > > > > > > ARM
> >     > > > > > > > USE_F16C: "ON" # Build with x86 F16C instruction
> support)
> > #
> >     > > > > autodetects support if "ON"
> >     > > > > > > > USE_LAPACK: "ON" # Build with lapack support
> >     > > > > > > > USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found
> >     > > > > > > > USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL
> > found)
> >     > > > > > > > IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> >     > > > > > > > USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL
> > found) IF
> >     > > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> >     > > > > > > > USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of
> > operators IF
> >     > > > > NOT
> >     > > > > > > > MSVC
> >     > > > > > > > USE_GPERFTOOLS: "ON" # Build with GPerfTools support
> (if
> > found)
> >     > > > > > > > USE_JEMALLOC: "ON" # Build with Jemalloc support
> >     > > > > > > > USE_PROFILER: "ON" # Build with Profiler support
> >     > > > > > > > USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE
> support
> >     > > > > > > > USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
> >     > > > > > > > USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
> >     > > > > > > > USE_CPP_PACKAGE: "OFF" # Build C++ Package
> >     > > > > > > > USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming
> >     > > > > conventions.
> >     > > > > > > > USE_GPROF: "OFF" # Compile with gprof (profiling) flag
> >     > > > > > > > USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the
> >     > compiler
> >     > > > > > > > supports it
> >     > > > > > > > USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE
> > (VTune)) #
> >     > > > > > > > one could set VTUNE_ROOT for search path
> >     > > > > > > > ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime
> > compilation
> >     > > > > > > > support
> >     > > > > > > > BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
> >     > > > > > > > INSTALL_EXAMPLES: "OFF" # Install the example source
> > files.
> >     > > > > > > > USE_SIGNAL_HANDLER: "ON" # Print stack traces on
> > segfaults.
> >     > > > > > > > USE_TENSORRT: "OFF" # Enable infeference optimization
> with
> >     > > > TensorRT.
> >     > > > > > > > USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
> >     > > > > > > > ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with
> test
> >     > > > > > > > coverage metric output
> >     > > > > > > > CMAKE_BUILD_TYPE: "Release"
> >     > > > > > > > CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
> >     > > > > > > > CMAKE_C_COMPILER_LAUNCHER: "ccache"
> >     > > > > > > > CMAKE_CXX_COMPILER_LAUNCHER: "ccache"
> >     > > > > > > >
> >     > > > > > > > commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD,
> > tag:
> >     > > > > > > > 1.5.0.rc1,
> >     > > > > > > > upstream/v1.5.x)
> >     > > > > > > > commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD,
> > tag:
> >     > > > > > > > 1.4.1.rc0,
> >     > > > > > > > upstream/v1.4.x)
> >     > > > > > > >
> >     > > > > > > > curl
> http://169.254.169.254/latest/meta-data/instance-type
> >     > > > > > > > c5d.18xlarge
> >     > > > > > > >
> >     > > > > > > >
> >     > > > > > > > Version      : 3.6.7
> >     > > > > > > > Compiler     : GCC 8.2.0
> >     > > > > > > > Build        : ('default', 'Oct 22 2018 11:32:17')
> >     > > > > > > > Arch         : ('64bit', 'ELF')
> >     > > > > > > > ------------Pip Info-----------
> >     > > > > > > > Version      : 19.1.1
> >     > > > > > > > Directory    :
> >     > /home/piotr/mxnet_1.5/py3_venv/lib/python3.6/site-
> >     > > > > packages/pip
> >     > > > > > > > ----------MXNet Info-----------
> >     > > > > > > > Version      : 1.5.0
> >     > > > > > > > Directory    : /home/piotr/mxnet_1.5/python/mxnet
> >     > > > > > > > Hashtag not found. Not installed from pre-built
> package.
> >     > > > > > > > ----------System Info----------
> >     > > > > > > > Platform     :
> >     > > > Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-18.04-bionic
> >     > > > > > > > system       : Linux
> >     > > > > > > > node         : ip-172-31-63-171
> >     > > > > > > > release      : 4.15.0-1035-aws
> >     > > > > > > > version      : #37-Ubuntu SMP Mon Mar 18 16:15:14 UTC
> 2019
> >     > > > > > > > ----------Hardware Info----------
> >     > > > > > > > machine      : x86_64
> >     > > > > > > > processor    : x86_64
> >     > > > > > > > Architecture:        x86_64
> >     > > > > > > > CPU op-mode(s):      32-bit, 64-bit
> >     > > > > > > > Byte Order:          Little Endian
> >     > > > > > > > CPU(s):              72
> >     > > > > > > > On-line CPU(s) list: 0-71
> >     > > > > > > > Thread(s) per core:  2
> >     > > > > > > > Core(s) per socket:  18
> >     > > > > > > > Socket(s):           2
> >     > > > > > > > NUMA node(s):        2
> >     > > > > > > > Vendor ID:           GenuineIntel
> >     > > > > > > > CPU family:          6
> >     > > > > > > > Model:               85
> >     > > > > > > > Model name:          Intel(R) Xeon(R) Platinum 8124M
> CPU @
> >     > 3.00GHz
> >     > > > > > > > Stepping:            4
> >     > > > > > > > CPU MHz:             1326.446
> >     > > > > > > > BogoMIPS:            6000.00
> >     > > > > > > > Hypervisor vendor:   KVM
> >     > > > > > > > Virtualization type: full
> >     > > > > > > > L1d cache:           32K
> >     > > > > > > > L1i cache:           32K
> >     > > > > > > > L2 cache:            1024K
> >     > > > > > > > L3 cache:            25344K
> >     > > > > > > > NUMA node0 CPU(s):   0-17,36-53
> >     > > > > > > > NUMA node1 CPU(s):   18-35,54-71
> >     > > > > > > > Flags:               fpu vme de pse tsc msr pae mce cx8
> > apic
> >     > sep
> >     > > > mtrr
> >     > > > > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht
> > syscall
> >     > > > > > > > nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good
> > nopl
> >     > > > > > > > xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq
> > monitor
> >     > > > > > > > ssse3 fma cx16 pcid
> >     > > > > > > > sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer
> aes
> > xsave
> >     > > > > > > > avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch
> >     > > > > > > > invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2
> smep
> > bmi2
> >     > > > > > > > erms invpcid rtm mpx avx512f avx512dq rdseed adx smap
> >     > clflushopt
> >     > > > > > > > clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1
> > xsaves
> >     > > > > > > > ida arat pku ospke ----------Network Test----------
> >     > > > > > > >
> >     > > > > > > > ----------Python Info----------
> >     > > > > > > > Version      : 3.6.7
> >     > > > > > > > Compiler     : GCC 8.2.0
> >     > > > > > > > Build        : ('default', 'Oct 22 2018 11:32:17')
> >     > > > > > > > Arch         : ('64bit', 'ELF')
> >     > > > > > > > ------------Pip Info-----------
> >     > > > > > > > Version      : 19.1.1
> >     > > > > > > > Directory    :
> >     > /home/piotr/mxnet_1.4/py3_venv/lib/python3.6/site-
> >     > > > > packages/pip
> >     > > > > > > > ----------MXNet Info-----------
> >     > > > > > > > Version      : 1.4.1
> >     > > > > > > > Directory    : /home/piotr/mxnet_1.4/python/mxnet
> >     > > > > > > > Hashtag not found. Not installed from pre-built
> package.
> >     > > > > > > > ----------System Info----------
> >     > > > > > > > Platform     :
> >     > > > Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-18.04-bionic
> >     > > > > > > > system       : Linux
> >     > > > > > > > node         : ip-172-31-63-171
> >     > > > > > > > release      : 4.15.0-1035-aws
> >     > > > > > > > version      : #37-Ubuntu SMP Mon Mar 18 16:15:14 UTC
> 2019
> >     > > > > > > > ----------Hardware Info----------
> >     > > > > > > > machine      : x86_64
> >     > > > > > > > processor    : x86_64
> >     > > > > > > > Architecture:        x86_64
> >     > > > > > > > CPU op-mode(s):      32-bit, 64-bit
> >     > > > > > > > Byte Order:          Little Endian
> >     > > > > > > > CPU(s):              72
> >     > > > > > > > On-line CPU(s) list: 0-71
> >     > > > > > > > Thread(s) per core:  2
> >     > > > > > > > Core(s) per socket:  18
> >     > > > > > > > Socket(s):           2
> >     > > > > > > > NUMA node(s):        2
> >     > > > > > > > Vendor ID:           GenuineIntel
> >     > > > > > > > CPU family:          6
> >     > > > > > > > Model:               85
> >     > > > > > > > Model name:          Intel(R) Xeon(R) Platinum 8124M
> CPU @
> >     > 3.00GHz
> >     > > > > > > > Stepping:            4
> >     > > > > > > > CPU MHz:             1223.344
> >     > > > > > > > BogoMIPS:            6000.00
> >     > > > > > > > Hypervisor vendor:   KVM
> >     > > > > > > > Virtualization type: full
> >     > > > > > > > L1d cache:           32K
> >     > > > > > > > L1i cache:           32K
> >     > > > > > > > L2 cache:            1024K
> >     > > > > > > > L3 cache:            25344K
> >     > > > > > > > NUMA node0 CPU(s):   0-17,36-53
> >     > > > > > > > NUMA node1 CPU(s):   18-35,54-71
> >     > > > > > > > Flags:               fpu vme de pse tsc msr pae mce cx8
> > apic
> >     > sep
> >     > > > mtrr
> >     > > > > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht
> > syscall
> >     > > > > > > > nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good
> > nopl
> >     > > > > > > > xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq
> > monitor
> >     > > > > > > > ssse3 fma cx16 pcid
> >     > > > > > > > sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer
> aes
> > xsave
> >     > > > > > > > avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch
> >     > > > > > > > invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2
> smep
> > bmi2
> >     > > > > > > > erms invpcid rtm mpx avx512f avx512dq rdseed adx smap
> >     > clflushopt
> >     > > > > > > > clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1
> > xsaves
> >     > > > > > > > ida arat pku ospke ----------Network Test----------
> >     > > > > > > >
> >     > > > > > > > On Tue, Jun 25, 2019 at 2:35 PM Pedro Larroy
> >     > > > > <pedro.larroy.lists@gmail.com> wrote:
> >     > > > > > > > >
> >     > > > > > > > > I did a training of cifar10 in CPU and seems there's
> > some
> >     > > > > > > > > regressions in the range of 7% increase of training
> time
> >     > against
> >     > > > 1.4.1:
> >     > > > > > > > >
> >     > > > > > > > > (py3_venv)
> >     > > > > > > > > piotr@ip-172-31-63-171
> > :0:~/deeplearning-benchmark/dawnbench
> >     > > > > > > > > (master)+$ time python cifar10.py --epochs 5
> >     > > > > > > > > real    11m30.388s
> >     > > > > > > > > user    417m7.766s
> >     > > > > > > > > sys     16m57.315s
> >     > > > > > > > >
> >     > > > > > > > > VS 1.4.1:
> >     > > > > > > > > real    10m41.994s
> >     > > > > > > > > user    392m40.646s
> >     > > > > > > > > sys     12m30.601s
> >     > > > > > > > >
> >     > > > > > > > >
> >     > > > > > > > > On Thu, Jun 20, 2019 at 10:15 PM Lai Wei <
> >     > royweilai@gmail.com>
> >     > > > > wrote:
> >     > > > > > > > > >
> >     > > > > > > > > > Hi Anirudh,
> >     > > > > > > > > >
> >     > > > > > > > > > Thanks for jumping into this quickly, I followed up
> > on the
> >     > > > issue.
> >     > > > > > > > > >
> >     > > > > > > > > > I was meant for sockeye developer/maintainers to
> help
> > setup
> >     > > > > > > > > > nightly tests and raise issues early.
> >     > > > > > > > > >
> >     > > > > > > > > > Thanks!
> >     > > > > > > > > >
> >     > > > > > > > > > On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin
> >     > > > > > > > > > <haibin.lin.aws@gmail.com>
> >     > > > > > > > > > wrote:
> >     > > > > > > > > >
> >     > > > > > > > > > > In GluonNLP we are testing with MXNET nightly
> build
> > for
> >     > > > > > > > > > > each PR, and we did find some MXNet related issue
> > caught
> >     > by
> >     > > > the CI.
> >     > > > > > > > > > > I recommend other toolkits also add integration
> > tests
> >     > with
> >     > > > > > > > > > > MXNet
> >     > > > > nightly.
> >     > > > > > > > > > > It helps identify issues early.
> >     > > > > > > > > > >
> >     > > > > > > > > > > Best,
> >     > > > > > > > > > > Haibin
> >     > > > > > > > > > >
> >     > > > > > > > > > > On Thu, Jun 20, 2019 at 18:52 Zhao, Patric
> >     > > > > > > > > > > <patric.zhao@intel.com>
> >     > > > > wrote:
> >     > > > > > > > > > >
> >     > > > > > > > > > > > Thanks to raise the issue and we will take a
> look
> > ASAP.
> >     > > > > > > > > > > >
> >     > > > > > > > > > > > The downstream cases is not in the MXNet CI so
> > it's
> >     > hard
> >     > > > > > > > > > > > to catch the potential bugs or performance
> > degradation
> >     > > > > > > > > > > > for
> >     > > > > MXNet developers.
> >     > > > > > > > > > > >
> >     > > > > > > > > > > > In the future, I suggest adding the major
> > downstream
> >     > > > > > > > > > > > test cases, like
> >     > > > > > > > > > > from
> >     > > > > > > > > > > > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into
> > the
> >     > > > > > > > > > > > nightly
> >     > > > > test.
> >     > > > > > > > > > > > If it's still too heavy,  maybe testing it
> weekly
> > or
> >     > > > > > > > > > > > monthly :)
> >     > > > > > > > > > > >
> >     > > > > > > > > > > > Thanks,
> >     > > > > > > > > > > >
> >     > > > > > > > > > > > --Patric
> >     > > > > > > > > > > >
> >     > > > > > > > > > > > > -----Original Message-----
> >     > > > > > > > > > > > > From: Anirudh Subramanian
> >     > > > > > > > > > > > > [mailto:anirudh2290@gmail.com]
> >     > > > > > > > > > > > > Sent: Friday, June 21, 2019 9:31 AM
> >     > > > > > > > > > > > > To: dev@mxnet.incubator.apache.org
> >     > > > > > > > > > > > > Cc: dev@mxnet.apache.org
> >     > > > > > > > > > > > > Subject: Re: [VOTE] Release Apache MXNet
> > (incubating)
> >     > > > > > > > > > > > > version
> >     > > > > > > > > > > > > 1.5.0.rc1
> >     > > > > > > > > > > > >
> >     > > > > > > > > > > > > Hi Lai,
> >     > > > > > > > > > > > >
> >     > > > > > > > > > > > > I have opened an issue:
> >     > > > > > > > > > > > >
> >     > https://github.com/apache/incubator-mxnet/issues/15297
> >     > > > > > > > > > > > > I came to know about this issue only today
> and
> > I have
> >     > > > > > > > > > > > > not been
> >     > > > > > > > > > > monitoring
> >     > > > > > > > > > > > > sockeye.
> >     > > > > > > > > > > > > I jumped onto this issue to make sure it
> wasn't
> >     > caused
> >     > > > > > > > > > > > > by the dlpack
> >     > > > > > > > > > > > changes.
> >     > > > > > > > > > > > > Also, I don't  think sockeye CI checks
> against
> >     > master,
> >     > > > > > > > > > > > > it is using
> >     > > > > > > > > > > 1.4.1.
> >     > > > > > > > > > > > >
> >     > > > > > > > > > > > > Anirudh
> >     > > > > > > > > > > > >
> >     > > > > > > > > > > > >
> >     > > > > > > > > > > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei
> >     > > > > > > > > > > > > <royweilai@gmail.com>
> >     > > > > wrote:
> >     > > > > > > > > > > > >
> >     > > > > > > > > > > > > > Hi,
> >     > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > Could you share which test failed and
> what’s
> > the
> >     > > > > > > > > > > > > > crash? How to reproduce it?
> >     > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > I was able to install sockeye and run all
> > tests
> >     > passed.
> >     > > > > > > > > > > > > > Using python setup.py test
> >     > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > I have tested both nightly pip package and
> >     > 1.5.0.rc1
> >     > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > It would be great to create an issue with
> >     > > > > > > > > > > > > > reproducible steps and move the discussion
> > there.
> >     > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > Also I see sockeye nightly build[1] has
> been
> >     > failing
> >     > > > > > > > > > > > > > for some time,
> >     > > > > > > > > > > if
> >     > > > > > > > > > > > > > it’s due to MXNet change, please raise this
> > early
> >     > so
> >     > > > > > > > > > > > > > we can track and solve it in time rather
> than
> > block
> >     > > > > > > > > > > > > > the release
> >     > > > > during vote time.
> >     > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > [1] https://travis-ci.org/awslabs/sockeye
> >     > > > > > > > > > > > > >
> >     > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh
> > Subramanian
> >     > > > > > > > > > > > > > <anirudh2290@gmail.com
> >     > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > wrote:
> >     > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > I was able to reproduce a crash with the
> > commit
> >     > > > > > > > > > > > > > > 09202f7f261954383aa387144524d38f83f18d06
> > but not
> >     > > > > > > > > > > > > > > with the commit
> >     > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> >     > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > Anirudh
> >     > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei
> >     > > > > > > > > > > > > > > <royweilai@gmail.com>
> >     > > > > > > > > > > wrote:
> >     > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > Hi Przemyslaw,
> >     > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > Is there an issue with more details to
> > track
> >     > the
> >     > > > problem?
> >     > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM
> Przemysław
> >     > > > > > > > > > > > > > > > Trędak <ptrendx@apache.org>
> >     > > > > > > > > > > > > > > > wrote:
> >     > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > -1
> >     > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > There is a crash in sockeye unit test
> > (python
> >     > > > > > > > > > > > > > > > > setup.py
> >     > > > > > > > > > > > > > > > > test) observed starting with nightly
> 1.5
> >     > build
> >     > > > > > > > > > > > > > > > > from
> >     > > > > > > > > > > > > > > > > 6/13 and still occuring in
> >     > > > > > > > > > > > > > > 1.5rc1. I
> >     > > > > > > > > > > > > > > > > don't yet have the exact commit that
> is
> >     > > > > > > > > > > > > > > > > responsible for it, but it is either
> >     > > > > > > > > > > > > > > > >
> a862270beb2d796c1ba311183f7f4a766a18ad6c
> >     > > > > > > > > > > > > > > > > (dlpack
> >     > > > > > > > > > > > > > > > > related) or
> >     > > > > > > > > > > > > > > > >
> 09202f7f261954383aa387144524d38f83f18d06
> >     > > > > > > > > > > > > > > > > (cached op
> >     > > > > > > > > > > > > optimization).
> >     > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > On 2019/06/20 06:36:22, Lai Wei
> >     > > > > > > > > > > > > > > > > <royweilai@gmail.com>
> >     > > > > wrote:
> >     > > > > > > > > > > > > > > > > > Dear MXNet community,
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > > This is the 3-day vote to release
> > Apache
> >     > > > > > > > > > > > > > > > > > MXNet
> >     > > > > > > > > > > > > > > > > > (incubating) version
> >     > > > > > > > > > > > > > > > > 1.5.0.
> >     > > > > > > > > > > > > > > > > > Voting on dev@ will start June 19,
> >     > > > > > > > > > > > > > > > > > 23:59:59(PST) and close
> >     > > > > > > > > > > on
> >     > > > > > > > > > > > > > June
> >     > > > > > > > > > > > > > > > 22,
> >     > > > > > > > > > > > > > > > > > 23:59:59.
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > > 1) Link to release notes:
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > >
> >     > > > > > > > > > >
> >     > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Re
> >     > > > > > > > > > > le
> >     > > > > > > > > > > ase+No
> >     > > > > > > > > > > te
> >     > > > > > > > > > > > > > > s
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > > 2) Link to release candidate:
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > >
> >     > https://github.com/apache/incubator-mxnet/releases/tag/1.5
> >     > > > > > > > > > > .0
> >     > > > > > > > > > > .r
> >     > > > > > > > > > > > > > > > > > c1
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > > 3) Link to source and signatures on
> > apache
> >     > > > dist server:
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > >
> >     > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5
> >     > > > > > > > > > > .0
> >     > > > > > > > > > > .r
> >     > > > > > > > > > > > > > > > > > c1/
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > > Please remember to TEST first
> before
> > voting
> >     > > > > accordingly:
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > > +1 = approve
> >     > > > > > > > > > > > > > > > > > +0 = no opinion
> >     > > > > > > > > > > > > > > > > > -1 = disapprove (provide reason)
> >     > > > > > > > > > > > > > > > > > --
> >     > > > > > > > > > > > > > > > > > Best Regards
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > > > Lai
> >     > > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > --
> >     > > > > > > > > > > > > > > > Best Regards
> >     > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > > > Lai
> >     > > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > --
> >     > > > > > > > > > > > > > Best Regards
> >     > > > > > > > > > > > > >
> >     > > > > > > > > > > > > > Lai
> >     > > > > > > > > > > > > >
> >     > > > > > > > > > > >
> >     > > > > > > > > > >
> >     > > > > > > > > > --
> >     > > > > > > > > > Best Regards
> >     > > > > > > > > >
> >     > > > > > > > > > Lai
> >     > > >
> >     > > --
> >     > > Best Regards
> >     > >
> >     > > Lai
> >     >
> >     >
> >
> >     --
> >     Sandeep Krishnamurthy
>
> --
Best Regards

Lai

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message