mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Kolodziej <mko...@gmail.com>
Subject Re: Details regarding upcoming PR for runtime TensorRT integration
Date Tue, 26 Jun 2018 18:19:57 GMT
Hi everyone,

Sorry for a delayed reply to this thread.

First of all, the updated documentation is now on Confluence:
https://cwiki.apache.org/confluence/display/MXNET/Runtime+Integration+with+TensorRT

Da, the details of partitioning the graph between non-TensorRT compatible
and incompatible nodes, which remain as MxNet native operators, are
provided on the above Wiki paged. The reason why the entire graph is not
converted to ONNX first is that ONNX also has a smaller subset of
compatible operators than MxNet does. So, the partitioning is done on the
NNVM graph. In order for a subgraph to be extracted, it has to be both
TensorRT compatible and ONNX compatible. Since ONNX's operator support is
strictly greater than TensorRT's, that's not a problem. The only exception
to this if a user has TensorRT plugins. However, this could be eventually
handled using ONNX extensions by storing metadata about nodes that are only
known to TensorRT via plugin registration. This could be done similarly as
PyTorch's ATen extension to ONNX. Also, the general goal of TensorRT
plugins is not ease of use, but extensibility of deployment. Typically, the
DL framework such as MxNet, is used for research and model training.
TensorRT runtime integration makes it easier to optimize inference
performance while not limiting the generality of the model on which to run
inference, because TensorRT isn't as general as the frameworks in terms of
operator support, and it also lacks the framework's data pipeline. In case
the user needs to deploy a hyperoptimized model that has operators that the
model does not support, they can port these operators to TensorRT plugins,
e.g. the way TensorRT plugins are used for non-maximum suppression for say
SSD. This lets the user deploy only with TensorRT, with the inconvenience
of rebuilding the data pipeline and adding plugins for operators that
TensorRT doesn't come with out of the box. Given this summary, it's
generally unlikely that a user would care about plugins for in-framework
inference, but they would if they need to deploy TensorRT as standalone.

I hope this helps!

Marek


On Mon, Jun 11, 2018 at 7:17 PM Da Zheng <zhengda1936@gmail.com> wrote:

> Hello Marek,
>
> Thank you for your detailed design doc. My understanding is that the
> current implementation is to convert an NNVM graph to an ONNX graph
> and load the ONNX graph to TensorRT.
> What is unclear to me is how an operator unsupported by TensorRT is
> handled in this strategy. It seems you fall back to the MXNet
> operators. Your current solution partitions a graph and loads
> subgraphs to TensorRT? If so, why do you need to convert a partitioned
> subgraph to ONNX first? If you convert the entire NNVM graph to ONNX,
> could you describe in more details how to fall back to MXNet
> operators?
>
> Thanks,
> Da
>
>
> On Mon, Jun 11, 2018 at 6:29 PM, Hagay Lupesko <lupesko@gmail.com> wrote:
> > +1 for reviewing a design doc.
> >
> > Naveen - why do you see it sit under ONNX? Isn't it a broader topic of
> GPU
> > acceleration?
> >
> > Hagay
> >
> > On Mon, Jun 11, 2018, 12:56 Naveen Swamy <mnnaveen@gmail.com> wrote:
> >
> >> please add your proposal under design proposals, once the community has
> >> reviewed and there is consensus on the approach we can create a
> ONNX-MXNet
> >> sub section and move there.
> >>
> >> On Mon, Jun 11, 2018 at 9:54 PM, Naveen Swamy <mnnaveen@gmail.com>
> wrote:
> >>
> >> > you have access now.
> >> >
> >> > On Mon, Jun 11, 2018 at 8:34 PM, Naveen Swamy <mnnaveen@gmail.com>
> >> wrote:
> >> >
> >> >> I'll add in about an hour
> >> >>
> >> >> > On Jun 11, 2018, at 8:12 PM, Marco de Abreu <
> >> >> marco.g.abreu@googlemail.com> wrote:
> >> >> >
> >> >> > I don't know how to grant permission on Confluence. If somebody
> else
> >> >> knows
> >> >> > how to do so, please grant Marek the edit permissions.
> >> >> >
> >> >> > -Marco
> >> >> >
> >> >> >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej <
> mkolod@gmail.com>
> >> >> wrote:
> >> >> >>
> >> >> >> Hi Rajan,
> >> >> >>
> >> >> >> I wanted to share on Confluence, but it didn't allow me to
create
> a
> >> new
> >> >> >> document. If my e-mail address gets permissions to add new
> Confluence
> >> >> >> pages, I'll transfer the contents to Confluence. Please keep
me
> >> posted
> >> >> when
> >> >> >> I get edit permissions.
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >> Marek
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Mon, Jun 11, 2018 at 11:02 AM singh.rajan28@gmail.com <
> >> >> >> singh.rajan28@gmail.com> wrote:
> >> >> >>
> >> >> >>> HI Marek,
> >> >> >>>
> >> >> >>> Thanks for sharing the  document. It would be great if
you could
> >> >> share it
> >> >> >>> on confluence wiki or a quip document. The formatting
here makes
> it
> >> >> very
> >> >> >>> difficult to read a long document.
> >> >> >>>
> >> >> >>> Appreciate the help.
> >> >> >>>
> >> >> >>> Thanks
> >> >> >>> Rajan
> >> >> >>>
> >> >> >>>> On 2018/06/11 17:50:26, Marek Kolodziej <mkolod@gmail.com>
> wrote:
> >> >> >>>> *Hi everyone,This is a quick summary of NVIDIA’s
plans for
> >> >> >> open-sourcing
> >> >> >>> an
> >> >> >>>> initial integration of TensorRT as a runtime accelerator
of
> MxNet
> >> (PR
> >> >> >> for
> >> >> >>>> discussion coming in the next few days, ETA of the
first draft
> of
> >> the
> >> >> >> PR
> >> >> >>> is
> >> >> >>>> this Friday or even earlier). Feedback is appreciated.Best,Marek
> >> >> >>>> KolodziejNeed for runtime MxNet-TensorRT integration
1. TensorRT
> >> >> >> provides
> >> >> >>>> significant acceleration of model inference on NVIDIA
GPUs
> compared
> >> >> to
> >> >> >>>> running the full graph in MxNet using unfused GPU
operators. In
> >> >> >> addition
> >> >> >>> to
> >> >> >>>> faster fp32 inference, TensorRT optimizes fp16 inference,
and is
> >> >> >> capable
> >> >> >>> of
> >> >> >>>> int8 inference (provided the quantization steps are
performed).
> >> >> Besides
> >> >> >>>> increasing throughput, TensorRT significantly reduces
inference
> >> >> >> latency,
> >> >> >>>> especially for small batches. See more here
> >> >> >>>> <https://developer.nvidia.com/tensorrt>.2. Despite
its
> benefits,
> >> >> using
> >> >> >>>> pre-trained models with TensorRT typically requires
some effort
> -
> >> >> >> either
> >> >> >>>> re-writing the model using TensorRT’s graph building
APIs, or
> >> >> >> exporting a
> >> >> >>>> model to ONNX, followed by an import step. Even if
the import is
> >> >> >>> simplified
> >> >> >>>> using ONNX, the TensorRT user still needs to provide
their own
> data
> >> >> >>>> pipeline, which used to exist in the framework, but
no longer
> does
> >> >> in a
> >> >> >>>> stand-alone TensorRT deployment with a client application.3.
> >> TensorRT
> >> >> >> is
> >> >> >>>> very performant, but does not have the full set of
MxNet’s
> >> operators.
> >> >> >>> While
> >> >> >>>> that could be addressed with TensorRT plugins, it’s
much
> simpler to
> >> >> >> reuse
> >> >> >>>> already-exisitng MxNet operators. Also, the user shouldn’t
care
> >> about
> >> >> >>>> knowing which operators are supported by TensorRT
and which ones
> >> >> >> aren’t -
> >> >> >>>> runtime integration allows the graph partitioner to
extract
> >> subgraphs
> >> >> >>>> capable of running inside of TensorRT, place the subgraph
in a
> >> >> TensorRT
> >> >> >>>> operator in MxNet, execute that operator as part of
MxNet’s
> graph
> >> >> >>>> execusion, and handle non-TensorRT-compatible nodes
as regular
> >> MxNet
> >> >> >>>> operators remaining after the TensorRT subgraph extraction
and
> node
> >> >> >>>> substitution. The goal is to accelerate inference
without
> changing
> >> >> user
> >> >> >>>> experience.Design considerations 1. Since TensorRT
can only
> >> determine
> >> >> >> all
> >> >> >>>> possible optimizations once the tensor shapes are
known, it is
> >> >> >> imperative
> >> >> >>>> that all the shape information be provided. This means
that the
> >> best
> >> >> >> time
> >> >> >>>> to construct the TensorRT graph is bind time. The
coming PR can
> >> >> >>> selectively
> >> >> >>>> apply the TensorRT optimization for inference-only
graphs at
> symbol
> >> >> >> bind
> >> >> >>>> time. This is in fact consistent with the assumptions
about
> >> TensorRT
> >> >> >> made
> >> >> >>>> on the MxNet Wiki here
> >> >> >>>> <
> >> >> >>>
> >> >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
> >> >> integration+with+external+acceleration+libraries
> >> >> >>>> .
> >> >> >>>> 2. Since as mentioned in #1, TensorRT graph building
needs shape
> >> >> >>>> information only available at bind time, an important
goal was
> not
> >> to
> >> >> >>>> disrupt any existing APIs. Even though C++ permits
default
> function
> >> >> >>>> arguments, the Python bindings for symbol-related
methods (e.g.
> >> >> simple
> >> >> >>>> bind) are exposed via a C, not C++, API, wired on
the Python
> side
> >> >> using
> >> >> >>>> Ctypes (e.g. see here
> >> >> >>>> <
> >> >> >>>
> >> >> >> https://github.com/apache/incubator-mxnet/blob/master/python
> >> >> /mxnet/symbol/symbol.py#L1486:L1521
> >> >> >>>>
> >> >> >>>> for the simple bind integration). This precludes the
addition of
> >> >> extra
> >> >> >>>> arguments without causing breaking changes in the
C API. Also,
> >> >> adapting
> >> >> >>> the
> >> >> >>>> Python code to such changes wouldn’t be enough,
since all
> frontend
> >> >> >>>> languages use the C (not C++) API for the FFI. Fortunately,
C
> API
> >> >> >> changes
> >> >> >>>> could be avoided, by simply letting the user enable
or disable
> the
> >> >> >>> TensorRT
> >> >> >>>> pass using an environment variable (USE_TENSORRT=1
to enable).
> This
> >> >> >> also
> >> >> >>>> does not diminish the flexibility of the integration,
since the
> >> graph
> >> >> >>> pass
> >> >> >>>> can read the environment variable each time symbol
binding is
> done,
> >> >> and
> >> >> >>>> hence permits turning the graph passes on and off,
depending on
> >> need.
> >> >> >> The
> >> >> >>>> ability to enable and disable the TensorRT pass at
runtime also
> >> makes
> >> >> >>> unit
> >> >> >>>> testing easier.3. TensorRT requires that the workspace
size is
> >> >> provided
> >> >> >>> at
> >> >> >>>> graph construction time. This value constitutes the
upper limit
> on
> >> >> the
> >> >> >>>> amount of memory that TensorRT can use, and does not
determine
> >> >> >> immediate
> >> >> >>>> use. Since this amount can be hard for the user to
know, its
> limit
> >> >> >> should
> >> >> >>>> be set to a reasonable value that the user need not
concern
> >> >> themselves
> >> >> >>>> with. Given that TensorRT integration is applied at
bind time
> and
> >> >> that
> >> >> >>>> TensorRT engines wrapped in TensorRT nodes are constructed
> during
> >> the
> >> >> >>> graph
> >> >> >>>> pass rather than the memory allocation pass,  MxNet
will only
> >> >> allocate
> >> >> >>> the
> >> >> >>>> amount needed for the nodes remaining after the TensorRT
> subgraphs
> >> >> have
> >> >> >>>> been extracted. This means that no memory will be
doubly
> allocated
> >> -
> >> >> >>> first
> >> >> >>>> for the complete MxNet subgraph and then for TensorRT.
However,
> the
> >> >> >>>> question remains whether the memory used per TensorRT
engine
> should
> >> >> be
> >> >> >> a
> >> >> >>>> configurable parameter, either as a method argument
or an
> >> environment
> >> >> >>>> variable, or whether TensorRT should be able to use
the maximum
> >> >> >> available
> >> >> >>>> GPU memory and then reserve only what it needs. I
would like to
> >> >> suggest
> >> >> >>> the
> >> >> >>>> latter. Since the TensorRT subgraph will typically
use less
> memory
> >> >> than
> >> >> >>> the
> >> >> >>>> same subgraph in MxNet (due to more layer fusion),
it’s
> extremely
> >> >> >>> unlikely
> >> >> >>>> that a model which runs purely as an MxNet graph would
fail
> with an
> >> >> ouf
> >> >> >>> of
> >> >> >>>> memory error when parts or most of the graph run inside
> TensorRT.
> >> >> Fewer
> >> >> >>>> knobs (in this case, not giving the user the ability
to tweak
> the
> >> >> >> maximum
> >> >> >>>> amount of memory availble to TensorRT would simplify
use.4.
> >> TensorRT
> >> >> >> can
> >> >> >>>> accept graphs constructed using two main approaches:
(a) via the
> >> >> >> TensorRT
> >> >> >>>> graph API, (b) using ONNX. Approach (a) seems simple
on the
> >> surface -
> >> >> >> one
> >> >> >>>> traverses the NNVM graph, finds subgraphs that TensorRT
can
> >> execute,
> >> >> >>>> converts the subgraphs to TensorRT graphs, and substitutes
the
> >> >> >> subgraphs
> >> >> >>>> with TensorRT nodes, each of which contain the TensorRT
engine
> >> >> >>>> corresponding to the subgraph. However, the approach
taken by
> NVIDA
> >> >> was
> >> >> >>> to
> >> >> >>>> use ONNX as tha IR. The reason for this is twofold.
First, ONNX
> is
> >> a
> >> >> >> very
> >> >> >>>> well-known IR, which is supported by the entire deep
learning
> >> >> software
> >> >> >>>> community. This ensures that the design of the IR
gets as much
> >> >> feedback
> >> >> >>> as
> >> >> >>>> possible as to whether the IR is feature complete,
and what the
> >> >> >> semantics
> >> >> >>>> are. NVIDIA already maintains an ONNX-to-TensorRT
converter
> (link
> >> >> >>>> <https://github.com/onnx/onnx-tensorrt>), and
will continue to
> do
> >> >> so.
> >> >> >>>> Whatever changes that may apply to the TensorRT APIs
or the
> >> internal
> >> >> >>>> features may be nicely hidden behind the well-established
ONNX
> IR.
> >> >> >>> Second,
> >> >> >>>> ONNX is growing beyond being merely an IR. As it becomes
more
> of a
> >> >> >>>> standard, its adoption will be associated with other
benefits,
> such
> >> >> as
> >> >> >>> the
> >> >> >>>> ability to verify standard compliance.5. Despite the
advantages
> of
> >> >> >> using
> >> >> >>>> the ONNX route described in #4, there are some costs.
The main
> one
> >> is
> >> >> >>> the
> >> >> >>>> dependency on Protobuf. This is a valid criticism
on the
> surface,
> >> >> >>> however,
> >> >> >>>> since the TensorRT integration requires an opt-in
during build
> >> time,
> >> >> >>> adding
> >> >> >>>> one more dependency is not a problem if it is not
a mandatory
> >> >> >> dependency.
> >> >> >>>> Moreover, the same Protobuf dependency already exists
for the
> MxNet
> >> >> >> ONNX
> >> >> >>>> importer, which is now part of the MxNet source tree
(link
> >> >> >>>> <
> >> >> >>>
> >> >> >> https://github.com/apache/incubator-mxnet/blob/76417594e56a8
> >> >> 5ec0cc9412b9dd2c7e2ab581d8b/docs/api/python/contrib/onnx.md
> >> >> >>>> ),
> >> >> >>>> rather than being located in a separate repository.
Just like
> the
> >> use
> >> >> >> of
> >> >> >>>> the ONNX importer is optional and requires ONNX (and
hence also
> >> >> >>> Protobuf),
> >> >> >>>> the TensorRT build is optional. 6. The optional integration
of
> >> >> TensorRT
> >> >> >>>> will be guarded using a config.mk <http://config.mk>
flag
> >> >> >>> (USE_TENSORRT),
> >> >> >>>> which will function similarly to other flags, such
as USE_CUDA,
> >> >> >>> USE_CUDNN,
> >> >> >>>> etc. Needless to say, USE_TENSORRT will depend on
CUDA and
> cuDNN.7.
> >> >> In
> >> >> >>>> order to simplify evaluation of the TensorRT build,
usability
> and
> >> to
> >> >> >> run
> >> >> >>>> unit tests, the PR will come with a Dockerfile, which
will allow
> >> >> anyone
> >> >> >>> to
> >> >> >>>> build MxNet with TensorRT, along with its dependencies,
i.e.
> >> Protobuf
> >> >> >> and
> >> >> >>>> ONNX. APIs / user experienceThere is no change in
the inference
> >> APIs,
> >> >> >>>> except for the need to set the MXNET_USE_TENSORRT
environment
> >> >> variable
> >> >> >> to
> >> >> >>>> 1. For example, in Python, we can simply
> >> >> >>>> do:os.environ["MXNET_USE_TENSORRT"] = “1”Note
that for backward
> >> >> >>>> compatibility, if the environment variable is not
set, it will
> >> >> default
> >> >> >> to
> >> >> >>>> 0. Also, unlike some other environment variables that
are only
> >> >> checked
> >> >> >>>> during MxNet initialization, this one gets checked
every time
> graph
> >> >> >>> binding
> >> >> >>>> happens. This typically happens only once during the
inference
> >> >> >>>> application’s life cycle, but since one can re-bind
a symbol to
> say
> >> >> >>> compare
> >> >> >>>> a TensorRT and a non-TensorRT run, the check will
happen during
> >> each
> >> >> >>>> bind/re-bind to enable that. Since the TensorRT graph
pass is
> >> enabled
> >> >> >>> using
> >> >> >>>> an environment variable, no break in the C++, C or
any frontend
> >> >> >> language
> >> >> >>>> API is needed. Note that there is one more change
required - in
> >> >> calling
> >> >> >>>> simple bind. This doesn’t change the simple bind
API, but how
> it’s
> >> >> >> called
> >> >> >>>> relative to the “usual” case, by using some of
the arguments
> which
> >> >> are
> >> >> >>>> optional. This has to do with the shared_buffer parameter.
> Before
> >> >> >>>> explaining how the call changes, let’s consider
why it’s
> necessary:
> >> >> 1.
> >> >> >>> The
> >> >> >>>> TensorRT graph needs to be constructed during the
simple bind
> call,
> >> >> but
> >> >> >>>> before memory gets allocated for the non-TensorRT
part of the
> >> graph.
> >> >> 2.
> >> >> >>>> TensorRT needs the weights, not just the shapes, to
be provided
> >> >> before
> >> >> >>> the
> >> >> >>>> engine is constructed - it will store them inside
the
> ICudaEngine
> >> >> >> object.
> >> >> >>>> The engine will then be serialized inside the NNVM
TensorRT op,
> and
> >> >> >>>> deserialized when the graph executor takes over. This
means that
> >> the
> >> >> >>>> weights need to be provided to the simple bind call
to construct
> >> the
> >> >> >>>> TensorRT engine.3. The way to provide the weights
is to hand
> them
> >> >> over
> >> >> >> to
> >> >> >>>> the simple bind call via the “shared buffer” argument.
The
> shared
> >> >> >> buffer
> >> >> >>>> weights can be provided during the bind call and can
be freed by
> >> the
> >> >> >>>> frontend language once binding is complete (e.g. by
exiting the
> >> >> >> relevant
> >> >> >>>> scope in Python, or calling del).Since we need both
arg_params
> >> >> >> (weights)
> >> >> >>>> and aux_params (e.g. BatchNorm moments), we need to
merge
> >> arg_params
> >> >> >> and
> >> >> >>>> aux_params into one dictionary. Here’s a Python
example:def
> >> >> >>>> merge_dicts(*dict_args):    """Merge arg_params and
aux_params
> to
> >> >> >>> populate
> >> >> >>>> shared_buffer"""    result = {}    for dictionary
in dict_args:
> >> >> >>>>       result.update(dictionary)    return resultNow
let’s see a
> use
> >> >> >>>> example:device = mx.gpu(0)sym, arg_params, aux_params
=
> >> >> >>>>   mx.model.load_checkpoint(model_name, num_epochs)executor
=
> >> >> >>>> sym.simple_bind(ctx=device,    data=data_shape,
> >> >> >>>>   softmax_label=(batch_size,),
> >> >> >> shared_buffer=merge_dicts(arg_params,
> >> >> >>>> aux_params),,    grad_req='null',    force_rebind=True)Now
we
> can
> >> >> >> simply
> >> >> >>>> update data in the executor’s arg dict and run the
forward
> >> >> >>>> pass:executor.arg_dict["data"][:] =
> >> >> >>>> my_data_batchexecutor.forward(is_train=False)predictions
=
> >> >> >>>> executor.outputs[0].asnumpy()Limitations of initial
integration
> and
> >> >> >>>> suggested future work 1. Since the new accelerator
API proposal
> >> (link
> >> >> >>>> <
> >> >> >>>
> >> >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
> >> >> integration+with+external+acceleration+libraries
> >> >> >>>> )
> >> >> >>>> was only published a few days ago and the implementation
is
> still
> >> on
> >> >> an
> >> >> >>>> MxNet fork, the current TensorRT integration doesn’t
use that
> API
> >> >> yet,
> >> >> >>> but
> >> >> >>>> could be refactored in a future commit to use it.
There is
> nothing
> >> in
> >> >> >> the
> >> >> >>>> current design that would prevent making use of that
API in the
> >> near
> >> >> >>>> future.2. Building the TensorRT engine takes a non-trivial
> amount
> >> of
> >> >> >>> time,
> >> >> >>>> because the compiler evaluates performance and the
hardware on
> the
> >> >> >> system
> >> >> >>>> before creating the fused layers on demand, and then
needs to
> >> >> actually
> >> >> >>>> compile them. For ResNet-50 this may be a few seconds,
but
> larger
> >> >> >> models
> >> >> >>>> also exist which may take longer. TensorRT comes with
the
> ability
> >> to
> >> >> >>>> serialize the TensorRT engine for a particular hardware
> platform.
> >> >> This
> >> >> >> is
> >> >> >>>> called the serialization of a TensorRT plan, which
is the engine
> >> >> along
> >> >> >>> with
> >> >> >>>> the ahead-of-time-compiled fused kernels for a given
GPU. The
> first
> >> >> PR
> >> >> >> of
> >> >> >>>> the TensorRT integration will not provide for TensorRT
plan
> >> caching,
> >> >> so
> >> >> >>>> using TensorRT might have a small start-up cost, but
for
> >> long-running
> >> >> >>>> inference processes, this shouldn’t be a problem.
Caching the
> >> >> TensorRT
> >> >> >>> plan
> >> >> >>>> will be addressed in a future commit.3. As mentioned
before, the
> >> >> >>>> reproducibility of the build will be demonstrated
using a Docker
> >> file
> >> >> >>> that
> >> >> >>>> will provide an easy way to evaluate the build. The
Docker
> recipe
> >> was
> >> >> >>>> tested on Linux on x86_64, but not other platforms
supported by
> >> >> >> TensorRT
> >> >> >>>> (Linux on 64-bit ARM  (aarch64), Android on aarch64,
QNX on
> >> aarch64).
> >> >> >>>> Supporting other platforms, e.g. Linux on aarch64
(e.g. L4T,
> i.e.
> >> >> Linux
> >> >> >>> for
> >> >> >>>> Tegra, on the NVIDIA Jetson platform) is left for
subsequent
> >> commits.
> >> >> >> 4.
> >> >> >>>> The current commit supports many, but not all, of
TensorRT
> >> operators.
> >> >> >> For
> >> >> >>>> example, this integration can run CNNs such as VGG,
or ResNet,
> but
> >> >> not
> >> >> >>>> necessarily everything that TensorRT can support.
More operators
> >> will
> >> >> >> be
> >> >> >>>> covered in future commits.5. TensorRT supports plugins,
which
> can
> >> be
> >> >> >>>> integrated into the graph pass. However, this was
not a priority
> >> >> since
> >> >> >>> the
> >> >> >>>> runtime TensorRT integration can always fall back
to existing
> MxNet
> >> >> >>>> operators. Supporting plugins is possible, but will
be added in
> >> >> future
> >> >> >>>> commits.6. The upcoming PR will support fp16 and fp32,
but not
> >> int8.
> >> >> >>> Since
> >> >> >>>> int8 support in MxNet is itself very new, figuring
out
> calibration
> >> >> and
> >> >> >>>> other details is left for a future commit.7. TensorRT
4 is
> going to
> >> >> >> have
> >> >> >>> a
> >> >> >>>> new feature called BYOM (bring your own memory). This
means that
> >> >> >> instead
> >> >> >>> of
> >> >> >>>> telling TensorRT how much memory it can use, the data/scratch
> space
> >> >> >>> tensors
> >> >> >>>> can be provided by MxNet, and can be re-used by MxNet
when not
> >> >> running
> >> >> >>> the
> >> >> >>>> forward pass. The memory in permanent use will then
be limited
> to
> >> >> >>> TensorRT
> >> >> >>>> storing weights. Support for this feature will be
added in a
> future
> >> >> >>> commit.*
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >> >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message