mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qing Lan <lanking...@live.com>
Subject Some feedback from MXNet Zhihu topic
Date Wed, 19 Sep 2018 18:04:37 GMT
Hi all,

There was a trend topic<https://www.zhihu.com/question/293996867> in Zhihu (a famous
Chinese Stackoverflow+Quora) asking about the status of MXNet in 2018 recently. Mu replied
the thread and obtained more than 300+ `like`.
However there are a few concerns addressed in the comments of this thread, I have done some
simple translation from Chinese to English:

1. Documentations! Until now, the online doc still contains:
                1. Depreciated but not updated doc
                2. Wrong documentation with poor description
                3. Document in Alpha stage such as you must install `pip –pre` in order
to run.

2. Examples! For Gluon specifically, many examples are still mixing Gluon/MXNet apis. The
mixure of mx.sym, mx.nd mx.gluon confused the users of what is the right one to choose in
order to get their model to work. As an example, Although Gluon made data encapsulation possible,
still there are examples using mxn.io.ImageRecordIter with tens of params (feels like gluon
examples are simply the copy from old Python examples).

3. Examples again! Comparing to PyTorch, there are a few examples I don't like in Gluon:
                1. Available to run however the code structure is still very complicated.
Such as example/image-classification/cifar10.py. It seemed like a consecutive code concatenation.
In fact, these are just a series of layers mixed with model.fit. It makes user very hard to
modify/extend the model.
                2. Only available to run with certain settings. If users try to change a little
bit in the model, crashes will happen. For example, the multi-gpu example in Gluon website,
MXNet hide the logic that using batch size to change learning rate in a optimizer. A lot of
newbies didn't know this fact and they would only find that the model stopped converging when
batch size changed.
                3. The worst scenario is the model itself just simply didn't work. Maintainers
in the MXNet community didn't run the model (even no integration test) and merge the code
directly. It makes the script not able run till somebody raise the issues and fix it.

4. The Community problem. The core advantage for MXNet is it's scalability and efficiency.
However, the documentation of some tools are confusing. Here are two examples:

                1. im2rec contains 2 versions, C++ (binary) and python. But nobody would thought
that the argparse in these tools are different (in the meantime, there is no appropriate examples
to compare with, users could only use them by guessing the usage).

                2. How to combine MXNet distributed platform with supercomputing tool such
as Slurm? How do we do profiling and how to debug. A couples of companies I knew thought of
using MXNet for distributed training. Due to lack of examples and poor support from the community,
they have to change their models into TensorFlow and Horovod.

5. The heavy code base. Most of the MXNet examples/source code/documentation/language binding
are in a single repo. A git clone operation will cost tens of Mb. The New feature PR would
takes longer time than expected. The poor reviewing response / rules keeps new contributors
away from the community. I remember there was a call for document-improvement last year. The
total timeline cost a user 3 months of time to merge into master. It almost equals to a release
interval of Pytorch.

6. To Developers. There are very few people in the community discussed the improvement we
can take to make MXNet more user-friendly. It's been so easy to trigger tens of stack issues
during coding. Again, is that a requirement for MXNet users to be familiar with C++? The connection
between Python and C lacks a IDE lint (maybe MXNet assume every developers as a VIM master).
API/underlying implementation chaged frequently. People have to release their code with an
achieved version of MXNet (such as TuSimple and MSRA). Let's take a look at PyTorch, an API
used move tensor to device would raise a thorough discussion.

There will be more comments translated to English and I will keep this thread updated…
Thanks,
Qing
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message