mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <>
Subject Re: A proposal for unified integration with external acceleration libraries
Date Tue, 05 Jun 2018 08:41:16 GMT
I definitely like this idea. We have also been discussing about the exact
same idea within our team and think that it allows MXNet to scale better
with the increasing number of backends. Imagine we get AMD-CPU, AMD-GPU,
ARM, Snapdragon and other vendors/chips. We are definitely running into
problems of maintainability and I'd definitely welcome a implementation
agnostic solution that can be implemented completely independent of each

We've been thinking about a structure like the following: We have a
pre-processing stage (similar to CuDNN autotuning) which determins which
operator implementation is the best to choose for that specific model on
the available hardware. During that stage, we also interact with optimizers
(which are modular as well) that allow to provide additional graph-fusion.
In the end, we will have a graph of nodes which may be from different
backends and thus have an associated type. This graph may consist of the
same number of nodes as the input-graph, or may contain less if
graph-fusion has been applied. In any case, our executor will run this
graph as usual, but will determine if a memory layout conversion (or
mem-copy to CPU/GPU/third-party-chip) is required. If two operators are of
the same backend-type, we don't execute any conversion and leave it as it
is. Think about the following structure (sorry, email list does not allow
> (Operator1-MKL -> Operator2-MKL) -> (Operator3-CUDA -> Operator4-CUDA) ->
(Operator5-MKL) -> (Operator6-MXNet)
In this case, this graph will be unrolled to the following:
> Mem-Convert-MXNet-MKL -> (Operator1-MKL -> Operator2-MKL) ->
Mem-Convert-Copy-MKL-CUDA -> (Operator3-CUDA  -> Operator4-CUDA) ->
Mem-Convert-Copy-CUDA-MKL ->  (Operator5-MKL) -> Mem-Convert-MKL-MXNet -> (
This might sound inefficient initially, but considering that we got some
*very* good implementations in some libraries/chips but none of them excell
in every single category, this is a way to get the best out of all worlds.
The clue here is the auto-tuning feature which also takes the required
overhead due to mem-copy and mem-conversion into account.

This approach would allow us to define the header of each operator (input,
output, additional parameters) and each framework would define them
individually. Every backend would have to bring the following converters at
least: Mem-Convert-MXNet-BACKEND & Mem-Convert-BACKEND-MXNet. This would
provide basic functionality. If no direct conversion between two backends
can be made, they will fall back to these required converters using the
MXNet memory as intermediary representation. Obviously, we'd like to
encourage developers to also provide direct converters. While this would be
N² converters, I still think that this is easily managable.

Sorry for hijacking this a bit, Da, but please feel free to give comments
about that approach or include it into the proposal.


On Mon, Jun 4, 2018 at 8:52 PM, Zheng, Da <> wrote:

> Hi Tao,
> Thanks for your feedbacks.
> For your questions:
> 1. This subgraph strategy is just a mechanism for integration with
> external libraries. We can use it if it provides benefits. It seems to me
> that CuDNN doesn't benefit much from this strategy. Although NHWC might be
> non-default, this layout just interprets dimensions of an array
> differently, which is very different from MKLDNN formats. The meaning of
> dimensions makes sense for only a few operators, so any operator that
> doesn't need to interpret dimensions can run on the arrays without any
> modification. It doesn't seem to me that it's necessary to isolate CuDNN
> operators from any other MXNet operators.
> 2. Imperative Gluon doesn't have subgraph. We can potentially consider an
> operator as a subgraph, so the strategy still works for Imperative Gluon.
> However, the question is why we want to make it work for imperative Gluon.
> Imperative Gluon is mainly used for debugging and doesn't care about
> performance much, while majority of the acceleration libraries I mentioned
> in the proposal is for accelerating inference and model serving. MKLDNN is
> probably the only exception. In the imperative gluon mode, we can have
> MKLDNN operators always output arrays with the default format.
> 3. You are absolutely right. The subgraph strategy can't avoid data
> conversion when conversion is needed. Currently, if the operators can
> understand both default and MKLDNN NDArrays, it works fine and we have
> spent a lot of time making this work well. However, the current MKLDNN
> backend can't handle well the interaction between the MKLDNN operators and
> the non-MKLDNN operators. This isn't just simply conversion between default
> NDArrays and MKLDNN NDArrays. To make this work, our choices are to
> * make all operators (the ones that use FComputeEx) to understand MKLDNN
> NDArray. This isn't scalable. There will be a lot of modifications on the
> operators. In the future, we might have more backends and we need to do the
> same for other backends.
> * have the executor to recognize MKLDNN operators and perform data
> conversion. This makes the executor complex and needs to understand all
> backends.
> * use the subgraph strategy to isolate MKLDNN operators. This is preferred
> for MKLDNN because the subgraph strategy is useful for many purposes (e.g.,
> integration with acceleration libraries, dynamic shape inference, etc). We
> don't need to do much to make the subgraph strategy work well with MKLDNN
> as well and keep the executor simple and easy to maintain.
> Another problem for the current implementation is that MKLDNN NDArrays are
> subject to the default memory planning of MXNet (this means an MKLDNN
> NDArray is reused in a computation graph). This problem caused a few bugs
> in the past and the fixes made the executor complex. The subgraph strategy
> can solve this problem in a cleaner way by using a different memory
> planning inside the MKLDNN subgraph (e.g., disable NDArray reuse inside the
> subgraph).
> Best,
> Da
> On 6/3/18, 10:28 PM, "Lv, Tao A" <> wrote:
>     Hi Da and other developers,
>     It's a great idea to limit external acceleration libs into certain
> scope and subgraph. I am not quite familiar with TVM and TensorRT's design.
> But from the side of MKL-DNN backend, here are my concerns on this proposal:
>     1. Is subgraph for all third party acceleration libraries or just for
> those have different data layouts? I guess cudnn are also using non-default
> data layout (say NHWC) for int8. So does cudnn path also need follow this
> proposal? Since I notice that cudnn is not mentioned in the proposal.
>     2. Would subgraph break the execution of imperative gluon interfaces?
> If we don't apply subgraph to imperative gluon, does that mean imperative
> gluon models cannot benefit from any acceleration libraries?
>     3. Currently, most issues of mkldnn backend are from the interchange
> between mxnet default ndarray and mkldnn memory. Even after subgraph is
> applied to mkldnn backend, there will still have some fallback processes
> for those inputs which are not supported by mkldnn or those inputs which
> are view of other tensors. So we still need deal with the layout
> transformation between mkldnn specific layouts and mxnet default layout. We
> cannot avoid these with the current design of subgraph.
>     For pushing mkldnn backend from 'experimental' to 'GA' in 1.3 release,
> we are working intensively to add more unit tests and improve the stability
> of it. Hopefully, these fixes and tests will upstream or be merged soon.
> Meanwhile, we are also trying to figure out how to improve the subgraph
> solution for properly addressing current issues and better extendibility in
> the future.
>     Any comments and suggestions will be highly appreciated. Thanks.
>     -tao
>     -----Original Message-----
>     From: Zheng, Da []
>     Sent: Saturday, June 2, 2018 4:38 AM
>     To:
>     Subject: A proposal for unified integration with external acceleration
> libraries
>     Hello all,
>     We would like to propose a new mechanism that unifies the integration
> with most of the external acceleration libraries, including TVM, MKLDNN,
> TensorRT and more. The main idea is to integrate with the external
> libraries in the level of subgraphs instead of operators.
>     There are a few reasons in favor of the new integration:
>       *   Integration in the level of operators mixes the external library
> operators, such as MKLDNN, with MXNet operators and makes the
> implementation of the executor overcomplicated. We now have to deal with a
> lot of unexpected issues. (the executor needs to carefully deal with data
> format conversion between different operators; the operators of external
> libraries are subject to the same memory planning like other MXNet
> operaotrs, etc).
>       *   External libraries need to reconstruct the computation graph for
> better performance (e.g., operator fusion). Integration in the level of
> subgraphs allows external libraries to perform arbitrary graph
> transformation and computation.
>     The proposal below provides both the design and the API for
> constructing subgraphs and executing subgraphs.
> Unified+integration+with+external+acceleration+libraries
>     Please let me know if you have any comments on this design and API.
>     Thanks,
>     Da

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message