tvm-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jared Roesch <>
Subject Re: [dmlc/tvm] [RFC] Relay Dynamic Runtime (#2810)
Date Mon, 25 Mar 2019 00:45:35 GMT
The implementation of the hardware is not of interest to the high-level Relay program all Tensor
to Tensor functions are black box. They can be implemented anyway you want, in C++, in TVM,
or as a hardware accelerator primitive. If you want to map a subset of the program down to
this hardware you will have to unroll it, which is required by most fixed-function hardware.
You can then replace the unrolled program as a new operation and rewrite the program to use
this instead. 

Hardware that does not support arbitrary & dynamic program sizes can not execute all models
of interest, they fundamentally don't fit into Halide/TVM style DSLs. The deep learning community
has focused on optimizing for  small subset of models with very regular behavior, but the
next wave of models invalidates many assumptions, such as statically known dimensions or static
control-flow required by polyhedral optimizers. The point of this VM is to coordinate at the
higher level where you need iteration, dynamic allocation, and communication. 

I have thought further about a register based VM and see no strong argument for why registers
are better than stacks. Most of the research on dynamic VMs focus on this distinction in order
to reduce memory movement and dispatch overhead while executing the application. Packed functions
*will* dominate execution time and optimizing for dispatch is an incredibly premature optimization.

The other argument for register based VMs is instruction level parallelism. Again instructions
don't matter much here, meaningful parallelism is happening at data dependencies between operators,
and inside the operators themselves (i.e parallel matrix mul).

The point of the parallel monad approach is not to add it to the source language, but that
the execution technique is valid way for us to get parallelism between operator invocations.
We can view the future graph as the data dependency graph and do graph reduction over it.

For example if I depend on a sequence of function calls it is valid to evaluate them in parallel
while evaluating a future computation that may depend on its result. The amount of synchronization
needed here is very small, and again the real opportunity for parallelism is inside operators
in the tensor to tensor functions. We don't need to worry about where the results are stored,
we essentially give it a register name when we push a future into stack position `n`. 

In a sense we already have an infinite register because we can address any stack position.
In this case we can easily address a future result by referencing position `n`.  The only
difference is the location where operations look for their result. We need a call stack for
functions, and function calls are the primary operation based on observations of current workloads.

Furthermore the current approach makes the VMCompiler far simpler, and easier to extend. 

I personally value simplicity, we have *zero* evidence that the current approach is slow,
in fact we have evidence of the contrary. The initial prototype is already faster than MxNet's
executor which is used in production at AWS.

You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message