Happy Thanksgiving, everyone :)
Hi Marco,
Thanks to raising this question. We will look into the details for CI test cases and Shufan
will provide the 1:1 OP level performance data.
In general, the CI tests is not performant cases which covered the different situations, even
corner cases, for the quality purpose.
Specifically, some overhead may be introduced if we connect an MKLDNN OP and non MKLDNN
OP where the extra data conversion will happen.
So, it's the expected behaviors that the performance drops in these kinds of test cases in
which the computation is quite tiny.
But in the real workload, these situations are not such frequency and the total performance
of MKLDNN will be much better than OpenBLAS.
(data in here: https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKLDNN++Performance+Benchmarking)
For a short time, Shufan will help to look into these test cases and figure out a proper solution
to make the CI faster.
For a medium and long time, we are working on implementing more MKLDNN supported Ops, like
reshape, slice, split, so that less data format conversion will be involved.
Feel free to let me know if you have any other concerns.
BR,
Patric
> Original Message
> From: Marco de Abreu [mailto:marco.g.abreu@googlemail.com.INVALID]
> Sent: Friday, November 23, 2018 9:22 AM
> To: dev@mxnet.incubator.apache.org
> Subject: MKLDNN performance in CI
>
> Hello,
>
> I have noticed that our Python tests have been increasing in duration recently.
> In order to analyse this further, I created the PR [1] which allows to record
> test durations. Please note that I did not dive deep on these numbers and
> that they have to be taken with a grain of salt since slaves have varying
> resource utilizations.
>
> Please have a look at the two following logs:
> Python3 CPU MKLDNN:
> http://jenkins.mxnetci.amazon
> ml.com/blue/rest/organizations/jenkins/pipelines/mxnet
> validation/pipelines/unixcpu/branches/PR
> 13377/runs/2/nodes/155/steps/409/log/?start=0
> Python3 CPU Openblas:
> http://jenkins.mxnetci.amazon
> ml.com/blue/rest/organizations/jenkins/pipelines/mxnet
> validation/pipelines/unixcpu/branches/PR
> 13377/runs/2/nodes/152/steps/398/log/?start=0
>
> If you scroll to the end (note that there are multiple test stages and
> summaries being printed in these logs), you will find the following
> statements:
>
> Python3 CPU MKLDNN: "Ran 702 tests in 3042.102s"
> Python3 CPU Openblas: "Ran 702 tests in 2158.458s"
>
> This shows that the MKLDNN is generally being about 40% slower than the
> Openblas backend. If we go into the details, we can see that some tests are
> significantly slower:
>
> Python3 CPU MKLDNN:
>
> >[success] 20.78% test_random.test_shuffle: 630.7165s [success] 17.79%
> >test_sparse_operator.test_elemwise_binary_ops: 540.0487s [success]
> >10.91% test_gluon_model_zoo.test_models: 331.1503s [success] 2.62%
> >test_operator.test_broadcast_binary_op: 79.4556s [success] 2.45%
> >test_operator.test_pick: 74.4041s [success] 2.39%
> >test_metric_perf.test_metric_performance: 72.5445s [success] 2.38%
> >test_random.test_negative_binomial_generator: 72.1751s [success] 1.84%
> >test_operator.test_psroipooling: 55.9432s [success] 1.78%
> >test_random.test_poisson_generator: 54.0104s [success] 1.72%
> >test_gluon.test_slice_pooling2d_slice_pooling2d: 52.3447s [success]
> >1.60% test_contrib_control_flow.test_cond: 48.6977s [success] 1.41%
> >test_random.test_random: 42.8712s [success] 1.03%
> >test_operator.test_layer_norm: 31.1242s
>
>
> Python3 CPU Openblas:
> > [success] 26.20% test_gluon_model_zoo.test_models: 563.3366s [success]
> > 4.34% test_random.test_shuffle: 93.3157s [success] 4.31%
> > test_random.test_negative_binomial_generator: 92.6899s [success] 3.78%
> > test_sparse_operator.test_elemwise_binary_ops: 81.2048s [success]
> > 3.30% test_operator.test_psroipooling: 70.9090s [success] 3.20%
> > test_random.test_poisson_generator: 68.7500s [success] 3.10%
> > test_metric_perf.test_metric_performance: 66.6085s [success] 2.79%
> > test_operator.test_layer_norm: 59.9566s [success] 2.66%
> > test_gluon.test_slice_pooling2d_slice_pooling2d: 57.1887s [success]
> > 2.62% test_operator.test_pick: 56.2312s [success] 2.60%
> > test_random.test_random: 55.8920s [success] 2.19%
> > test_operator.test_broadcast_binary_op: 47.1879s [success] 0.96%
> > test_contrib_control_flow.test_cond: 20.6908s
>
> Tests worth noting:
>  test_random.test_shuffle: 700% increase  but I don't know how this may
> be related to MKLDNN. Are we doing random number generation in either of
> those backends?
>  test_sparse_operator.test_elemwise_binary_ops: 700% increase
>  test_gluon_model_zoo.test_models: 40% decrease  that's awesome and to
> be expect :)
>  test_operator.test_broadcast_binary_op: 80% increase
>  test_contrib_control_flow.test_cond: 250% increase
>  test_operator.test_layer_norm: 50% decrease  nice!
>
> As I have stated previously, these numbers might not mean anything since
> the CI is not a benchmarking environment (sorry if these are false negatives),
> but I thought it might be worth mentioning so Intel could follow up and dive
> deeper.
>
> Does anybody here create 1:1 operator comparisons (e.g. running
> layer_norm in the different backends to compare the performance) who
> could provide us with those numbers?
>
> Best regards,
> Marco
>
> [1]: https://github.com/apache/incubatormxnet/pull/13377
