From dev-return-4204-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Tue Sep 25 05:43:19 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id ABC64180649 for ; Tue, 25 Sep 2018 05:43:18 +0200 (CEST) Received: (qmail 1103 invoked by uid 500); 25 Sep 2018 03:43:17 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 1091 invoked by uid 99); 25 Sep 2018 03:43:16 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Sep 2018 03:43:16 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 3A556C01C1 for ; Tue, 25 Sep 2018 03:43:16 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.889 X-Spam-Level: * X-Spam-Status: No, score=1.889 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 94KNbEFnF0cH for ; Tue, 25 Sep 2018 03:43:12 +0000 (UTC) Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com [209.85.167.41]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 02F9D5F54A for ; Tue, 25 Sep 2018 03:43:12 +0000 (UTC) Received: by mail-lf1-f41.google.com with SMTP id z186-v6so18657353lfa.5 for ; Mon, 24 Sep 2018 20:43:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=ttxW1XusWitfwQw8Ph0FoO0wK/thl0eMjSbKBzQXRMg=; b=kRAb4+QItLeE6GUg7ShPqBmqmERlYrlXS8K72rngd9T9X7s8Ypt1WoiehvGi5U0wtL ZbM7FdICnguziaAuuncwnu4YwQ75XCofoteFJE9zeXpsafPvy+d44CEmUwWRmyWVu4rM G/Dy7YNUUGPvBYCaqa1HAh7SE+XxNnJCx6Oa+/e3VBf2pYHgROOjZ/RnpuXTvhmadmuI X3svp0sLKMiRBi7cZJachw/Ln2bvHOrWCki9w7FCLhwxV63clm4pTeVw62l7p5MLI/Cq d8g0EVhlTTTd+LlRoUupQwS1qn55WSZ7IYPbaSEu+QrW9aspu7Zvo9I7qHbpgr9x2kfF J+pg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=ttxW1XusWitfwQw8Ph0FoO0wK/thl0eMjSbKBzQXRMg=; b=CbB2nKmgG4abW1w6wkfmBonnEV7Lb6LGWYn+e8RHSn6R2oTGkPMZ5tz7tG6FhiZr/T 7Uogw9XMuGdPlXrUVlmqQ4Zbl/X5r2xj7i7AJsqujN5196uhpHMxaAW4Ug5p0ssQSEmG cssXbFnEEUsBkBvjIsH5struGdwrleCOG2oNbfF4lPYR4ENuMl46yt0vbHgg3yFgCrq/ fn//Pb0Fl3EAauqCF022HXR5FbUJ526hYGM/RtYIhpfK5k0ah53JUWChzD79SbiElK/l tioBQHkAXCLuQ/bFGYNuOZBTG403/pw1Fg9MXu8WyU1+Zb2qA9agjaG2a/i13CUIFk/6 vXaA== X-Gm-Message-State: ABuFfoiSxFB+Y8ftMUUgukZKiyQLfZgaLQlQrj4APNP1G4M2iloPqlGm 7UobwCiKTHU4I12EXUTDOKjS2t5zUcPwV/57U1G4ssZ2 X-Google-Smtp-Source: ACcGV61oVmEhrS3xw+BaMNpYJtHVXbozesvMRguO4ULlIMsHYcILVwpDgP51XiSQUb8lgjNGkmRx8fA2rH0+pw42oK8= X-Received: by 2002:a19:1cc3:: with SMTP id c186-v6mr1038558lfc.16.1537846990655; Mon, 24 Sep 2018 20:43:10 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Foivos Diakogiannis Date: Tue, 25 Sep 2018 11:42:58 +0800 Message-ID: Subject: Re: Some feedback from MXNet Zhihu topic To: dev@mxnet.incubator.apache.org Content-Type: multipart/alternative; boundary="00000000000098d68d0576a9e728" --00000000000098d68d0576a9e728 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Dear all, first my compliments on this great software, and thank you all for the effort you've put into this. I am a gluon API user, and I thought I should give some feedback to highlight some user-perspective issues. I am working in CSIRO and I am using gluon to write and deploy custom deep learning models for semantic segmentation/classification on CSIRO HPC facilities. I came into the deep learning world as of July 2017 (2nd postdoc, after astronomy), starting with Keras (great intro, but too simple/automated for my needs), moving on to TF (complexity of C++, with the inconvenience of python performance + memory management was bad. On the plus side, great documentation+community support, and of course great product overall, just not for me) and as of December 2017 I am using gluon exclusively since it solved the majority of my problems. Things I love about gluon: 1. Great structured tutorials (https://gluon.mxnet.io/), like a book. In fact, at the time of starting using Gluon, this was better (i.e. more structured, with a beginning and an end) than pytorch documentation. 2. Efficient code, both in speed and GPU memory consumption. 3. With a push of a button (hybridize) I can go from research to production. I get up to x3-4 speed up, this is a huge benefit and I don't see other frameworks easily beating that, in the immediate future. torch.jit is nowhere near the ease of use of hybridize() - not yet. 4. I really value the effort/support given in discuss.mxnet.io forum. Almost always when I have a problem I find a solution there, from experts. This complements my knowledge/understanding on the code around the gluon API. 5. Super easy data parallel modeling. The examples in the tutorial make life really easy. This made a huge difference for me, and it was the main reason I switched from TF to gluon. Things I find difficult with gluon: 1. Documentation is not in one place, so gluon-cv and gluon-nlp are things I've learned they exist (and they have great examples) via twitter. These should be on the main mxnet page, somewhere altogether (they should actually be advertised). In addition, some times the examples are not updated with the latest changes. For example, mynet.collect_params().initialize(...) on the gluon "book" should now be mynet.initialize(...), and several other examples on the same spirit. Also, I don't see a clear definition/description of new methods when added to know how to improve my code, in the release announcements. For example, I've learned about the block.summary(*inputs) features by checking on the pull requests. Yes, it exists on the official API documentation, and I am used in going through all of every now and then. Can be done better. 2. Not all custom architectures are easy to implement in a hybrid format. For example, taking the shape of a layer and using this as information for pooling layers (or other things) is not easy (without copying to cpu first), and many times I have to implement many hacks to get around this (for performance gains). For example, here: https://discuss.mxnet.io/t/coordconv-layer/1394/4 Another example is the pyramid scene parsing network, it took me a lot of time and many hacks to hybridize it. 3. The distributed examples are not yet fully functional. When one needs to run distributed computing for increasing the batch size is OK-ish (under SLURM manager, see this: https://discuss.mxnet.io/t/distributed-training-questions/1269/6 ), but when one wants to implement async SGD - at least for me - is still an open problem. Of course, I completely understand that distributed training is still very much a research project, and I am not sure if using a large batch size is good for training (hence my effort to use async SGD). I've read various opinions on research papers for this. At the moment I am using distributed only for hyper parameter optimization, as I increase the batch size (when necessary) with delayed gradient updates. 4. No higher order gradient support. This is where pytorch is better, and where I am forced to use it in my GAN experiments for gradient penalty implementation ( https://github.com/apache/incubator-mxnet/issues/10002). I hope that this will change in the immediate future. It is my understanding that a lot of effort goes into semi-supervised training techniques and my gut feeling tells me that GANs are an important key ingredient to the solution of this problem. Things I really don't like about mxnet: 1. The documentation for C++ is not clear. I am developing code in C++ for the past 8 years. I am not a software engineer by training but I feel comfortable-ish in looking, say, in the source code of boost library or Eigen. I cannot say the same thing for mxnet. This is a barrier for me to even think contributing in C++ code. Again, many thanks for all your efforts and this awesome library! Regards, Foivos Diakogiannis On Fri, Sep 21, 2018 at 12:51 AM Timur Shenkao wrote: > There are: > Gluon API > Module API > Some other apis in mxnet > low-level C / C++ apis > > Recently I accidentally found that exist such things like Gluon NLP and > Gluon CV (besides some examples in the very MXNet). > It's unclear whether I can rely on some API or I have to create my own C = / > C++ code. > > I implement publicly available articles and some other ideas in TF all th= e > time. But when it comes to MXNet, I am often reluctant because it's > difficult to understand which way to go. It's unclear whether my efforts > will result in some working model or I will get stuck. > Points #5 and #6 are absolutely true. > As for documentation, all projects in their turbulent phase of lifecycle > have outdated docs, it's normal. I say docs are very good (I remember ear= ly > Spark & DL4J docs =F0=9F=98=82 ) > > > > On Thursday, September 20, 2018, Tianqi Chen > wrote: > > > The key complain here is mainly about the clarity of the documents > > themselves. Maybe it is time to focus on a single flavor of API that is > > useful(Gluon) and highlight all the docs around that > > > > Tianqi > > > > > > On Wed, Sep 19, 2018 at 11:04 AM Qing Lan wrote: > > > > > Hi all, > > > > > > There was a trend topic in > > > Zhihu (a famous Chinese Stackoverflow+Quora) asking about the status = of > > > MXNet in 2018 recently. Mu replied the thread and obtained more than > 300+ > > > `like`. > > > However there are a few concerns addressed in the comments of this > > thread, > > > I have done some simple translation from Chinese to English: > > > > > > 1. Documentations! Until now, the online doc still contains: > > > 1. Depreciated but not updated doc > > > 2. Wrong documentation with poor description > > > 3. Document in Alpha stage such as you must install > `pip > > > =E2=80=93pre` in order to run. > > > > > > 2. Examples! For Gluon specifically, many examples are still mixing > > > Gluon/MXNet apis. The mixure of mx.sym, mx.nd mx.gluon confused the > users > > > of what is the right one to choose in order to get their model to wor= k. > > As > > > an example, Although Gluon made data encapsulation possible, still > there > > > are examples using mxn.io.ImageRecordIter with tens of params (feels > like > > > gluon examples are simply the copy from old Python examples). > > > > > > 3. Examples again! Comparing to PyTorch, there are a few examples I > don't > > > like in Gluon: > > > 1. Available to run however the code structure is sti= ll > > > very complicated. Such as example/image-classification/cifar10.py. It > > > seemed like a consecutive code concatenation. In fact, these are just= a > > > series of layers mixed with model.fit. It makes user very hard to > > > modify/extend the model. > > > 2. Only available to run with certain settings. If > users > > > try to change a little bit in the model, crashes will happen. For > > example, > > > the multi-gpu example in Gluon website, MXNet hide the logic that usi= ng > > > batch size to change learning rate in a optimizer. A lot of newbies > > didn't > > > know this fact and they would only find that the model stopped > converging > > > when batch size changed. > > > 3. The worst scenario is the model itself just simply > > > didn't work. Maintainers in the MXNet community didn't run the model > > (even > > > no integration test) and merge the code directly. It makes the script > not > > > able run till somebody raise the issues and fix it. > > > > > > 4. The Community problem. The core advantage for MXNet is it's > > scalability > > > and efficiency. However, the documentation of some tools are confusin= g. > > > Here are two examples: > > > > > > 1. im2rec contains 2 versions, C++ (binary) and pytho= n. > > > But nobody would thought that the argparse in these tools are differe= nt > > (in > > > the meantime, there is no appropriate examples to compare with, users > > could > > > only use them by guessing the usage). > > > > > > 2. How to combine MXNet distributed platform with > > > supercomputing tool such as Slurm? How do we do profiling and how to > > debug. > > > A couples of companies I knew thought of using MXNet for distributed > > > training. Due to lack of examples and poor support from the community= , > > they > > > have to change their models into TensorFlow and Horovod. > > > > > > 5. The heavy code base. Most of the MXNet examples/source > > > code/documentation/language binding are in a single repo. A git clone > > > operation will cost tens of Mb. The New feature PR would takes longer > > time > > > than expected. The poor reviewing response / rules keeps new > contributors > > > away from the community. I remember there was a call for > > > document-improvement last year. The total timeline cost a user 3 mont= hs > > of > > > time to merge into master. It almost equals to a release interval of > > > Pytorch. > > > > > > 6. To Developers. There are very few people in the community discusse= d > > the > > > improvement we can take to make MXNet more user-friendly. It's been s= o > > easy > > > to trigger tens of stack issues during coding. Again, is that a > > requirement > > > for MXNet users to be familiar with C++? The connection between Pytho= n > > and > > > C lacks a IDE lint (maybe MXNet assume every developers as a VIM > master). > > > API/underlying implementation chaged frequently. People have to relea= se > > > their code with an achieved version of MXNet (such as TuSimple and > MSRA). > > > Let's take a look at PyTorch, an API used move tensor to device would > > raise > > > a thorough discussion. > > > > > > There will be more comments translated to English and I will keep thi= s > > > thread updated=E2=80=A6 > > > Thanks, > > > Qing > > > > > > --00000000000098d68d0576a9e728--