mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Kolodziej <>
Subject Details regarding upcoming PR for runtime TensorRT integration
Date Mon, 11 Jun 2018 17:50:26 GMT
*Hi everyone,This is a quick summary of NVIDIA’s plans for open-sourcing an
initial integration of TensorRT as a runtime accelerator of MxNet (PR for
discussion coming in the next few days, ETA of the first draft of the PR is
this Friday or even earlier). Feedback is appreciated.Best,Marek
KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT provides
significant acceleration of model inference on NVIDIA GPUs compared to
running the full graph in MxNet using unfused GPU operators. In addition to
faster fp32 inference, TensorRT optimizes fp16 inference, and is capable of
int8 inference (provided the quantization steps are performed). Besides
increasing throughput, TensorRT significantly reduces inference latency,
especially for small batches. See more here
<>.2. Despite its benefits, using
pre-trained models with TensorRT typically requires some effort - either
re-writing the model using TensorRT’s graph building APIs, or exporting a
model to ONNX, followed by an import step. Even if the import is simplified
using ONNX, the TensorRT user still needs to provide their own data
pipeline, which used to exist in the framework, but no longer does in a
stand-alone TensorRT deployment with a client application.3. TensorRT is
very performant, but does not have the full set of MxNet’s operators. While
that could be addressed with TensorRT plugins, it’s much simpler to reuse
already-exisitng MxNet operators. Also, the user shouldn’t care about
knowing which operators are supported by TensorRT and which ones aren’t -
runtime integration allows the graph partitioner to extract subgraphs
capable of running inside of TensorRT, place the subgraph in a TensorRT
operator in MxNet, execute that operator as part of MxNet’s graph
execusion, and handle non-TensorRT-compatible nodes as regular MxNet
operators remaining after the TensorRT subgraph extraction and node
substitution. The goal is to accelerate inference without changing user
experience.Design considerations 1. Since TensorRT can only determine all
possible optimizations once the tensor shapes are known, it is imperative
that all the shape information be provided. This means that the best time
to construct the TensorRT graph is bind time. The coming PR can selectively
apply the TensorRT optimization for inference-only graphs at symbol bind
time. This is in fact consistent with the assumptions about TensorRT made
on the MxNet Wiki here
2. Since as mentioned in #1, TensorRT graph building needs shape
information only available at bind time, an important goal was not to
disrupt any existing APIs. Even though C++ permits default function
arguments, the Python bindings for symbol-related methods (e.g. simple
bind) are exposed via a C, not C++, API, wired on the Python side using
Ctypes (e.g. see here
for the simple bind integration). This precludes the addition of extra
arguments without causing breaking changes in the C API. Also, adapting the
Python code to such changes wouldn’t be enough, since all frontend
languages use the C (not C++) API for the FFI. Fortunately, C API changes
could be avoided, by simply letting the user enable or disable the TensorRT
pass using an environment variable (USE_TENSORRT=1 to enable). This also
does not diminish the flexibility of the integration, since the graph pass
can read the environment variable each time symbol binding is done, and
hence permits turning the graph passes on and off, depending on need. The
ability to enable and disable the TensorRT pass at runtime also makes unit
testing easier.3. TensorRT requires that the workspace size is provided at
graph construction time. This value constitutes the upper limit on the
amount of memory that TensorRT can use, and does not determine immediate
use. Since this amount can be hard for the user to know, its limit should
be set to a reasonable value that the user need not concern themselves
with. Given that TensorRT integration is applied at bind time and that
TensorRT engines wrapped in TensorRT nodes are constructed during the graph
pass rather than the memory allocation pass,  MxNet will only allocate the
amount needed for the nodes remaining after the TensorRT subgraphs have
been extracted. This means that no memory will be doubly allocated - first
for the complete MxNet subgraph and then for TensorRT. However, the
question remains whether the memory used per TensorRT engine should be a
configurable parameter, either as a method argument or an environment
variable, or whether TensorRT should be able to use the maximum available
GPU memory and then reserve only what it needs. I would like to suggest the
latter. Since the TensorRT subgraph will typically use less memory than the
same subgraph in MxNet (due to more layer fusion), it’s extremely unlikely
that a model which runs purely as an MxNet graph would fail with an ouf of
memory error when parts or most of the graph run inside TensorRT. Fewer
knobs (in this case, not giving the user the ability to tweak the maximum
amount of memory availble to TensorRT would simplify use.4. TensorRT can
accept graphs constructed using two main approaches: (a) via the TensorRT
graph API, (b) using ONNX. Approach (a) seems simple on the surface - one
traverses the NNVM graph, finds subgraphs that TensorRT can execute,
converts the subgraphs to TensorRT graphs, and substitutes the subgraphs
with TensorRT nodes, each of which contain the TensorRT engine
corresponding to the subgraph. However, the approach taken by NVIDA was to
use ONNX as tha IR. The reason for this is twofold. First, ONNX is a very
well-known IR, which is supported by the entire deep learning software
community. This ensures that the design of the IR gets as much feedback as
possible as to whether the IR is feature complete, and what the semantics
are. NVIDIA already maintains an ONNX-to-TensorRT converter (link
<>), and will continue to do so.
Whatever changes that may apply to the TensorRT APIs or the internal
features may be nicely hidden behind the well-established ONNX IR. Second,
ONNX is growing beyond being merely an IR. As it becomes more of a
standard, its adoption will be associated with other benefits, such as the
ability to verify standard compliance.5. Despite the advantages of using
 the ONNX route described in #4, there are some costs. The main one is the
dependency on Protobuf. This is a valid criticism on the surface, however,
since the TensorRT integration requires an opt-in during build time, adding
one more dependency is not a problem if it is not a mandatory dependency.
Moreover, the same Protobuf dependency already exists for the MxNet ONNX
importer, which is now part of the MxNet source tree (link
rather than being located in a separate repository. Just like the use of
the ONNX importer is optional and requires ONNX (and hence also Protobuf),
the TensorRT build is optional. 6. The optional integration of TensorRT
will be guarded using a <> flag (USE_TENSORRT),
which will function similarly to other flags, such as USE_CUDA, USE_CUDNN,
etc. Needless to say, USE_TENSORRT will depend on CUDA and cuDNN.7. In
order to simplify evaluation of the TensorRT build, usability and to run
unit tests, the PR will come with a Dockerfile, which will allow anyone to
build MxNet with TensorRT, along with its dependencies, i.e. Protobuf and
ONNX. APIs / user experienceThere is no change in the inference APIs,
except for the need to set the MXNET_USE_TENSORRT environment variable to
1. For example, in Python, we can simply
do:os.environ["MXNET_USE_TENSORRT"] = “1”Note that for backward
compatibility, if the environment variable is not set, it will default to
0. Also, unlike some other environment variables that are only checked
during MxNet initialization, this one gets checked every time graph binding
happens. This typically happens only once during the inference
application’s life cycle, but since one can re-bind a symbol to say compare
a TensorRT and a non-TensorRT run, the check will happen during each
bind/re-bind to enable that. Since the TensorRT graph pass is enabled using
an environment variable, no break in the C++, C or any frontend language
API is needed. Note that there is one more change required - in calling
simple bind. This doesn’t change the simple bind API, but how it’s called
relative to the “usual” case, by using some of the arguments which are
optional. This has to do with the shared_buffer parameter. Before
explaining how the call changes, let’s consider why it’s necessary: 1. The
TensorRT graph needs to be constructed during the simple bind call, but
before memory gets allocated for the non-TensorRT part of the graph. 2.
TensorRT needs the weights, not just the shapes, to be provided before the
engine is constructed - it will store them inside the ICudaEngine object.
The engine will then be serialized inside the NNVM TensorRT op, and
deserialized when the graph executor takes over. This means that the
weights need to be provided to the simple bind call to construct the
TensorRT engine.3. The way to provide the weights is to hand them over to
the simple bind call via the “shared buffer” argument. The shared buffer
weights can be provided during the bind call and can be freed by the
frontend language once binding is complete (e.g. by exiting the relevant
scope in Python, or calling del).Since we need both arg_params (weights)
and aux_params (e.g. BatchNorm moments), we need to merge arg_params and
aux_params into one dictionary. Here’s a Python example:def
merge_dicts(*dict_args):    """Merge arg_params and aux_params to populate
shared_buffer"""    result = {}    for dictionary in dict_args:
       result.update(dictionary)    return resultNow let’s see a use
example:device = mx.gpu(0)sym, arg_params, aux_params =
   mx.model.load_checkpoint(model_name, num_epochs)executor =
sym.simple_bind(ctx=device,    data=data_shape,
   softmax_label=(batch_size,),    shared_buffer=merge_dicts(arg_params,
aux_params),,    grad_req='null',    force_rebind=True)Now we can simply
update data in the executor’s arg dict and run the forward
pass:executor.arg_dict["data"][:] =
my_data_batchexecutor.forward(is_train=False)predictions =
executor.outputs[0].asnumpy()Limitations of initial integration and
suggested future work 1. Since the new accelerator API proposal (link
was only published a few days ago and the implementation is still on an
MxNet fork, the current TensorRT integration doesn’t use that API yet, but
could be refactored in a future commit to use it. There is nothing in the
current design that would prevent making use of that API in the near
future.2. Building the TensorRT engine takes a non-trivial amount of time,
because the compiler evaluates performance and the hardware on the system
before creating the fused layers on demand, and then needs to actually
compile them. For ResNet-50 this may be a few seconds, but larger models
also exist which may take longer. TensorRT comes with the ability to
serialize the TensorRT engine for a particular hardware platform. This is
called the serialization of a TensorRT plan, which is the engine along with
the ahead-of-time-compiled fused kernels for a given GPU. The first PR of
the TensorRT integration will not provide for TensorRT plan caching, so
using TensorRT might have a small start-up cost, but for long-running
inference processes, this shouldn’t be a problem. Caching the TensorRT plan
will be addressed in a future commit.3. As mentioned before, the
reproducibility of the build will be demonstrated using a Docker file that
will provide an easy way to evaluate the build. The Docker recipe was
tested on Linux on x86_64, but not other platforms supported by TensorRT
(Linux on 64-bit ARM  (aarch64), Android on aarch64, QNX on aarch64).
Supporting other platforms, e.g. Linux on aarch64 (e.g. L4T, i.e. Linux for
Tegra, on the NVIDIA Jetson platform) is left for subsequent commits. 4.
The current commit supports many, but not all, of TensorRT operators. For
example, this integration can run CNNs such as VGG, or ResNet, but not
necessarily everything that TensorRT can support. More operators will be
covered in future commits.5. TensorRT supports plugins, which can be
integrated into the graph pass. However, this was not a priority since the
runtime TensorRT integration can always fall back to existing MxNet
operators. Supporting plugins is possible, but will be added in future
commits.6. The upcoming PR will support fp16 and fp32, but not int8. Since
int8 support in MxNet is itself very new, figuring out calibration and
other details is left for a future commit.7. TensorRT 4 is going to have a
new feature called BYOM (bring your own memory). This means that instead of
telling TensorRT how much memory it can use, the data/scratch space tensors
can be provided by MxNet, and can be re-used by MxNet when not running the
forward pass. The memory in permanent use will then be limited to TensorRT
storing weights. Support for this feature will be added in a future commit.*

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message