mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Da Zheng <zhengda1...@gmail.com>
Subject Re: Details regarding upcoming PR for runtime TensorRT integration
Date Tue, 12 Jun 2018 02:16:46 GMT
Hello Marek,

Thank you for your detailed design doc. My understanding is that the
current implementation is to convert an NNVM graph to an ONNX graph
and load the ONNX graph to TensorRT.
What is unclear to me is how an operator unsupported by TensorRT is
handled in this strategy. It seems you fall back to the MXNet
operators. Your current solution partitions a graph and loads
subgraphs to TensorRT? If so, why do you need to convert a partitioned
subgraph to ONNX first? If you convert the entire NNVM graph to ONNX,
could you describe in more details how to fall back to MXNet
operators?

Thanks,
Da


On Mon, Jun 11, 2018 at 6:29 PM, Hagay Lupesko <lupesko@gmail.com> wrote:
> +1 for reviewing a design doc.
>
> Naveen - why do you see it sit under ONNX? Isn't it a broader topic of GPU
> acceleration?
>
> Hagay
>
> On Mon, Jun 11, 2018, 12:56 Naveen Swamy <mnnaveen@gmail.com> wrote:
>
>> please add your proposal under design proposals, once the community has
>> reviewed and there is consensus on the approach we can create a ONNX-MXNet
>> sub section and move there.
>>
>> On Mon, Jun 11, 2018 at 9:54 PM, Naveen Swamy <mnnaveen@gmail.com> wrote:
>>
>> > you have access now.
>> >
>> > On Mon, Jun 11, 2018 at 8:34 PM, Naveen Swamy <mnnaveen@gmail.com>
>> wrote:
>> >
>> >> I'll add in about an hour
>> >>
>> >> > On Jun 11, 2018, at 8:12 PM, Marco de Abreu <
>> >> marco.g.abreu@googlemail.com> wrote:
>> >> >
>> >> > I don't know how to grant permission on Confluence. If somebody else
>> >> knows
>> >> > how to do so, please grant Marek the edit permissions.
>> >> >
>> >> > -Marco
>> >> >
>> >> >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej <mkolod@gmail.com>
>> >> wrote:
>> >> >>
>> >> >> Hi Rajan,
>> >> >>
>> >> >> I wanted to share on Confluence, but it didn't allow me to create
a
>> new
>> >> >> document. If my e-mail address gets permissions to add new Confluence
>> >> >> pages, I'll transfer the contents to Confluence. Please keep me
>> posted
>> >> when
>> >> >> I get edit permissions.
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Marek
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Mon, Jun 11, 2018 at 11:02 AM singh.rajan28@gmail.com <
>> >> >> singh.rajan28@gmail.com> wrote:
>> >> >>
>> >> >>> HI Marek,
>> >> >>>
>> >> >>> Thanks for sharing the  document. It would be great if you
could
>> >> share it
>> >> >>> on confluence wiki or a quip document. The formatting here
makes it
>> >> very
>> >> >>> difficult to read a long document.
>> >> >>>
>> >> >>> Appreciate the help.
>> >> >>>
>> >> >>> Thanks
>> >> >>> Rajan
>> >> >>>
>> >> >>>> On 2018/06/11 17:50:26, Marek Kolodziej <mkolod@gmail.com>
wrote:
>> >> >>>> *Hi everyone,This is a quick summary of NVIDIA’s plans
for
>> >> >> open-sourcing
>> >> >>> an
>> >> >>>> initial integration of TensorRT as a runtime accelerator
of MxNet
>> (PR
>> >> >> for
>> >> >>>> discussion coming in the next few days, ETA of the first
draft of
>> the
>> >> >> PR
>> >> >>> is
>> >> >>>> this Friday or even earlier). Feedback is appreciated.Best,Marek
>> >> >>>> KolodziejNeed for runtime MxNet-TensorRT integration 1.
TensorRT
>> >> >> provides
>> >> >>>> significant acceleration of model inference on NVIDIA GPUs
compared
>> >> to
>> >> >>>> running the full graph in MxNet using unfused GPU operators.
In
>> >> >> addition
>> >> >>> to
>> >> >>>> faster fp32 inference, TensorRT optimizes fp16 inference,
and is
>> >> >> capable
>> >> >>> of
>> >> >>>> int8 inference (provided the quantization steps are performed).
>> >> Besides
>> >> >>>> increasing throughput, TensorRT significantly reduces inference
>> >> >> latency,
>> >> >>>> especially for small batches. See more here
>> >> >>>> <https://developer.nvidia.com/tensorrt>.2. Despite
its benefits,
>> >> using
>> >> >>>> pre-trained models with TensorRT typically requires some
effort -
>> >> >> either
>> >> >>>> re-writing the model using TensorRT’s graph building
APIs, or
>> >> >> exporting a
>> >> >>>> model to ONNX, followed by an import step. Even if the
import is
>> >> >>> simplified
>> >> >>>> using ONNX, the TensorRT user still needs to provide their
own data
>> >> >>>> pipeline, which used to exist in the framework, but no
longer does
>> >> in a
>> >> >>>> stand-alone TensorRT deployment with a client application.3.
>> TensorRT
>> >> >> is
>> >> >>>> very performant, but does not have the full set of MxNet’s
>> operators.
>> >> >>> While
>> >> >>>> that could be addressed with TensorRT plugins, it’s much
simpler to
>> >> >> reuse
>> >> >>>> already-exisitng MxNet operators. Also, the user shouldn’t
care
>> about
>> >> >>>> knowing which operators are supported by TensorRT and which
ones
>> >> >> aren’t -
>> >> >>>> runtime integration allows the graph partitioner to extract
>> subgraphs
>> >> >>>> capable of running inside of TensorRT, place the subgraph
in a
>> >> TensorRT
>> >> >>>> operator in MxNet, execute that operator as part of MxNet’s
graph
>> >> >>>> execusion, and handle non-TensorRT-compatible nodes as
regular
>> MxNet
>> >> >>>> operators remaining after the TensorRT subgraph extraction
and node
>> >> >>>> substitution. The goal is to accelerate inference without
changing
>> >> user
>> >> >>>> experience.Design considerations 1. Since TensorRT can
only
>> determine
>> >> >> all
>> >> >>>> possible optimizations once the tensor shapes are known,
it is
>> >> >> imperative
>> >> >>>> that all the shape information be provided. This means
that the
>> best
>> >> >> time
>> >> >>>> to construct the TensorRT graph is bind time. The coming
PR can
>> >> >>> selectively
>> >> >>>> apply the TensorRT optimization for inference-only graphs
at symbol
>> >> >> bind
>> >> >>>> time. This is in fact consistent with the assumptions about
>> TensorRT
>> >> >> made
>> >> >>>> on the MxNet Wiki here
>> >> >>>> <
>> >> >>>
>> >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
>> >> integration+with+external+acceleration+libraries
>> >> >>>> .
>> >> >>>> 2. Since as mentioned in #1, TensorRT graph building needs
shape
>> >> >>>> information only available at bind time, an important goal
was not
>> to
>> >> >>>> disrupt any existing APIs. Even though C++ permits default
function
>> >> >>>> arguments, the Python bindings for symbol-related methods
(e.g.
>> >> simple
>> >> >>>> bind) are exposed via a C, not C++, API, wired on the Python
side
>> >> using
>> >> >>>> Ctypes (e.g. see here
>> >> >>>> <
>> >> >>>
>> >> >> https://github.com/apache/incubator-mxnet/blob/master/python
>> >> /mxnet/symbol/symbol.py#L1486:L1521
>> >> >>>>
>> >> >>>> for the simple bind integration). This precludes the addition
of
>> >> extra
>> >> >>>> arguments without causing breaking changes in the C API.
Also,
>> >> adapting
>> >> >>> the
>> >> >>>> Python code to such changes wouldn’t be enough, since
all frontend
>> >> >>>> languages use the C (not C++) API for the FFI. Fortunately,
C API
>> >> >> changes
>> >> >>>> could be avoided, by simply letting the user enable or
disable the
>> >> >>> TensorRT
>> >> >>>> pass using an environment variable (USE_TENSORRT=1 to enable).
This
>> >> >> also
>> >> >>>> does not diminish the flexibility of the integration, since
the
>> graph
>> >> >>> pass
>> >> >>>> can read the environment variable each time symbol binding
is done,
>> >> and
>> >> >>>> hence permits turning the graph passes on and off, depending
on
>> need.
>> >> >> The
>> >> >>>> ability to enable and disable the TensorRT pass at runtime
also
>> makes
>> >> >>> unit
>> >> >>>> testing easier.3. TensorRT requires that the workspace
size is
>> >> provided
>> >> >>> at
>> >> >>>> graph construction time. This value constitutes the upper
limit on
>> >> the
>> >> >>>> amount of memory that TensorRT can use, and does not determine
>> >> >> immediate
>> >> >>>> use. Since this amount can be hard for the user to know,
its limit
>> >> >> should
>> >> >>>> be set to a reasonable value that the user need not concern
>> >> themselves
>> >> >>>> with. Given that TensorRT integration is applied at bind
time and
>> >> that
>> >> >>>> TensorRT engines wrapped in TensorRT nodes are constructed
during
>> the
>> >> >>> graph
>> >> >>>> pass rather than the memory allocation pass,  MxNet will
only
>> >> allocate
>> >> >>> the
>> >> >>>> amount needed for the nodes remaining after the TensorRT
subgraphs
>> >> have
>> >> >>>> been extracted. This means that no memory will be doubly
allocated
>> -
>> >> >>> first
>> >> >>>> for the complete MxNet subgraph and then for TensorRT.
However, the
>> >> >>>> question remains whether the memory used per TensorRT engine
should
>> >> be
>> >> >> a
>> >> >>>> configurable parameter, either as a method argument or
an
>> environment
>> >> >>>> variable, or whether TensorRT should be able to use the
maximum
>> >> >> available
>> >> >>>> GPU memory and then reserve only what it needs. I would
like to
>> >> suggest
>> >> >>> the
>> >> >>>> latter. Since the TensorRT subgraph will typically use
less memory
>> >> than
>> >> >>> the
>> >> >>>> same subgraph in MxNet (due to more layer fusion), it’s
extremely
>> >> >>> unlikely
>> >> >>>> that a model which runs purely as an MxNet graph would
fail with an
>> >> ouf
>> >> >>> of
>> >> >>>> memory error when parts or most of the graph run inside
TensorRT.
>> >> Fewer
>> >> >>>> knobs (in this case, not giving the user the ability to
tweak the
>> >> >> maximum
>> >> >>>> amount of memory availble to TensorRT would simplify use.4.
>> TensorRT
>> >> >> can
>> >> >>>> accept graphs constructed using two main approaches: (a)
via the
>> >> >> TensorRT
>> >> >>>> graph API, (b) using ONNX. Approach (a) seems simple on
the
>> surface -
>> >> >> one
>> >> >>>> traverses the NNVM graph, finds subgraphs that TensorRT
can
>> execute,
>> >> >>>> converts the subgraphs to TensorRT graphs, and substitutes
the
>> >> >> subgraphs
>> >> >>>> with TensorRT nodes, each of which contain the TensorRT
engine
>> >> >>>> corresponding to the subgraph. However, the approach taken
by NVIDA
>> >> was
>> >> >>> to
>> >> >>>> use ONNX as tha IR. The reason for this is twofold. First,
ONNX is
>> a
>> >> >> very
>> >> >>>> well-known IR, which is supported by the entire deep learning
>> >> software
>> >> >>>> community. This ensures that the design of the IR gets
as much
>> >> feedback
>> >> >>> as
>> >> >>>> possible as to whether the IR is feature complete, and
what the
>> >> >> semantics
>> >> >>>> are. NVIDIA already maintains an ONNX-to-TensorRT converter
(link
>> >> >>>> <https://github.com/onnx/onnx-tensorrt>), and will
continue to do
>> >> so.
>> >> >>>> Whatever changes that may apply to the TensorRT APIs or
the
>> internal
>> >> >>>> features may be nicely hidden behind the well-established
ONNX IR.
>> >> >>> Second,
>> >> >>>> ONNX is growing beyond being merely an IR. As it becomes
more of a
>> >> >>>> standard, its adoption will be associated with other benefits,
such
>> >> as
>> >> >>> the
>> >> >>>> ability to verify standard compliance.5. Despite the advantages
of
>> >> >> using
>> >> >>>> the ONNX route described in #4, there are some costs. The
main one
>> is
>> >> >>> the
>> >> >>>> dependency on Protobuf. This is a valid criticism on the
surface,
>> >> >>> however,
>> >> >>>> since the TensorRT integration requires an opt-in during
build
>> time,
>> >> >>> adding
>> >> >>>> one more dependency is not a problem if it is not a mandatory
>> >> >> dependency.
>> >> >>>> Moreover, the same Protobuf dependency already exists for
the MxNet
>> >> >> ONNX
>> >> >>>> importer, which is now part of the MxNet source tree (link
>> >> >>>> <
>> >> >>>
>> >> >> https://github.com/apache/incubator-mxnet/blob/76417594e56a8
>> >> 5ec0cc9412b9dd2c7e2ab581d8b/docs/api/python/contrib/onnx.md
>> >> >>>> ),
>> >> >>>> rather than being located in a separate repository. Just
like the
>> use
>> >> >> of
>> >> >>>> the ONNX importer is optional and requires ONNX (and hence
also
>> >> >>> Protobuf),
>> >> >>>> the TensorRT build is optional. 6. The optional integration
of
>> >> TensorRT
>> >> >>>> will be guarded using a config.mk <http://config.mk>
flag
>> >> >>> (USE_TENSORRT),
>> >> >>>> which will function similarly to other flags, such as USE_CUDA,
>> >> >>> USE_CUDNN,
>> >> >>>> etc. Needless to say, USE_TENSORRT will depend on CUDA
and cuDNN.7.
>> >> In
>> >> >>>> order to simplify evaluation of the TensorRT build, usability
and
>> to
>> >> >> run
>> >> >>>> unit tests, the PR will come with a Dockerfile, which will
allow
>> >> anyone
>> >> >>> to
>> >> >>>> build MxNet with TensorRT, along with its dependencies,
i.e.
>> Protobuf
>> >> >> and
>> >> >>>> ONNX. APIs / user experienceThere is no change in the inference
>> APIs,
>> >> >>>> except for the need to set the MXNET_USE_TENSORRT environment
>> >> variable
>> >> >> to
>> >> >>>> 1. For example, in Python, we can simply
>> >> >>>> do:os.environ["MXNET_USE_TENSORRT"] = “1”Note that
for backward
>> >> >>>> compatibility, if the environment variable is not set,
it will
>> >> default
>> >> >> to
>> >> >>>> 0. Also, unlike some other environment variables that are
only
>> >> checked
>> >> >>>> during MxNet initialization, this one gets checked every
time graph
>> >> >>> binding
>> >> >>>> happens. This typically happens only once during the inference
>> >> >>>> application’s life cycle, but since one can re-bind a
symbol to say
>> >> >>> compare
>> >> >>>> a TensorRT and a non-TensorRT run, the check will happen
during
>> each
>> >> >>>> bind/re-bind to enable that. Since the TensorRT graph pass
is
>> enabled
>> >> >>> using
>> >> >>>> an environment variable, no break in the C++, C or any
frontend
>> >> >> language
>> >> >>>> API is needed. Note that there is one more change required
- in
>> >> calling
>> >> >>>> simple bind. This doesn’t change the simple bind API,
but how it’s
>> >> >> called
>> >> >>>> relative to the “usual” case, by using some of the
arguments which
>> >> are
>> >> >>>> optional. This has to do with the shared_buffer parameter.
Before
>> >> >>>> explaining how the call changes, let’s consider why it’s
necessary:
>> >> 1.
>> >> >>> The
>> >> >>>> TensorRT graph needs to be constructed during the simple
bind call,
>> >> but
>> >> >>>> before memory gets allocated for the non-TensorRT part
of the
>> graph.
>> >> 2.
>> >> >>>> TensorRT needs the weights, not just the shapes, to be
provided
>> >> before
>> >> >>> the
>> >> >>>> engine is constructed - it will store them inside the ICudaEngine
>> >> >> object.
>> >> >>>> The engine will then be serialized inside the NNVM TensorRT
op, and
>> >> >>>> deserialized when the graph executor takes over. This means
that
>> the
>> >> >>>> weights need to be provided to the simple bind call to
construct
>> the
>> >> >>>> TensorRT engine.3. The way to provide the weights is to
hand them
>> >> over
>> >> >> to
>> >> >>>> the simple bind call via the “shared buffer” argument.
The shared
>> >> >> buffer
>> >> >>>> weights can be provided during the bind call and can be
freed by
>> the
>> >> >>>> frontend language once binding is complete (e.g. by exiting
the
>> >> >> relevant
>> >> >>>> scope in Python, or calling del).Since we need both arg_params
>> >> >> (weights)
>> >> >>>> and aux_params (e.g. BatchNorm moments), we need to merge
>> arg_params
>> >> >> and
>> >> >>>> aux_params into one dictionary. Here’s a Python example:def
>> >> >>>> merge_dicts(*dict_args):    """Merge arg_params and aux_params
to
>> >> >>> populate
>> >> >>>> shared_buffer"""    result = {}    for dictionary in dict_args:
>> >> >>>>       result.update(dictionary)    return resultNow let’s
see a use
>> >> >>>> example:device = mx.gpu(0)sym, arg_params, aux_params =
>> >> >>>>   mx.model.load_checkpoint(model_name, num_epochs)executor
=
>> >> >>>> sym.simple_bind(ctx=device,    data=data_shape,
>> >> >>>>   softmax_label=(batch_size,),
>> >> >> shared_buffer=merge_dicts(arg_params,
>> >> >>>> aux_params),,    grad_req='null',    force_rebind=True)Now
we can
>> >> >> simply
>> >> >>>> update data in the executor’s arg dict and run the forward
>> >> >>>> pass:executor.arg_dict["data"][:] =
>> >> >>>> my_data_batchexecutor.forward(is_train=False)predictions
=
>> >> >>>> executor.outputs[0].asnumpy()Limitations of initial integration
and
>> >> >>>> suggested future work 1. Since the new accelerator API
proposal
>> (link
>> >> >>>> <
>> >> >>>
>> >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
>> >> integration+with+external+acceleration+libraries
>> >> >>>> )
>> >> >>>> was only published a few days ago and the implementation
is still
>> on
>> >> an
>> >> >>>> MxNet fork, the current TensorRT integration doesn’t
use that API
>> >> yet,
>> >> >>> but
>> >> >>>> could be refactored in a future commit to use it. There
is nothing
>> in
>> >> >> the
>> >> >>>> current design that would prevent making use of that API
in the
>> near
>> >> >>>> future.2. Building the TensorRT engine takes a non-trivial
amount
>> of
>> >> >>> time,
>> >> >>>> because the compiler evaluates performance and the hardware
on the
>> >> >> system
>> >> >>>> before creating the fused layers on demand, and then needs
to
>> >> actually
>> >> >>>> compile them. For ResNet-50 this may be a few seconds,
but larger
>> >> >> models
>> >> >>>> also exist which may take longer. TensorRT comes with the
ability
>> to
>> >> >>>> serialize the TensorRT engine for a particular hardware
platform.
>> >> This
>> >> >> is
>> >> >>>> called the serialization of a TensorRT plan, which is the
engine
>> >> along
>> >> >>> with
>> >> >>>> the ahead-of-time-compiled fused kernels for a given GPU.
The first
>> >> PR
>> >> >> of
>> >> >>>> the TensorRT integration will not provide for TensorRT
plan
>> caching,
>> >> so
>> >> >>>> using TensorRT might have a small start-up cost, but for
>> long-running
>> >> >>>> inference processes, this shouldn’t be a problem. Caching
the
>> >> TensorRT
>> >> >>> plan
>> >> >>>> will be addressed in a future commit.3. As mentioned before,
the
>> >> >>>> reproducibility of the build will be demonstrated using
a Docker
>> file
>> >> >>> that
>> >> >>>> will provide an easy way to evaluate the build. The Docker
recipe
>> was
>> >> >>>> tested on Linux on x86_64, but not other platforms supported
by
>> >> >> TensorRT
>> >> >>>> (Linux on 64-bit ARM  (aarch64), Android on aarch64, QNX
on
>> aarch64).
>> >> >>>> Supporting other platforms, e.g. Linux on aarch64 (e.g.
L4T, i.e.
>> >> Linux
>> >> >>> for
>> >> >>>> Tegra, on the NVIDIA Jetson platform) is left for subsequent
>> commits.
>> >> >> 4.
>> >> >>>> The current commit supports many, but not all, of TensorRT
>> operators.
>> >> >> For
>> >> >>>> example, this integration can run CNNs such as VGG, or
ResNet, but
>> >> not
>> >> >>>> necessarily everything that TensorRT can support. More
operators
>> will
>> >> >> be
>> >> >>>> covered in future commits.5. TensorRT supports plugins,
which can
>> be
>> >> >>>> integrated into the graph pass. However, this was not a
priority
>> >> since
>> >> >>> the
>> >> >>>> runtime TensorRT integration can always fall back to existing
MxNet
>> >> >>>> operators. Supporting plugins is possible, but will be
added in
>> >> future
>> >> >>>> commits.6. The upcoming PR will support fp16 and fp32,
but not
>> int8.
>> >> >>> Since
>> >> >>>> int8 support in MxNet is itself very new, figuring out
calibration
>> >> and
>> >> >>>> other details is left for a future commit.7. TensorRT 4
is going to
>> >> >> have
>> >> >>> a
>> >> >>>> new feature called BYOM (bring your own memory). This means
that
>> >> >> instead
>> >> >>> of
>> >> >>>> telling TensorRT how much memory it can use, the data/scratch
space
>> >> >>> tensors
>> >> >>>> can be provided by MxNet, and can be re-used by MxNet when
not
>> >> running
>> >> >>> the
>> >> >>>> forward pass. The memory in permanent use will then be
limited to
>> >> >>> TensorRT
>> >> >>>> storing weights. Support for this feature will be added
in a future
>> >> >>> commit.*
>> >> >>>>
>> >> >>>
>> >> >>
>> >>
>> >
>> >
>>

Mime
View raw message