tvm-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lianmin Zheng <>
Subject [dmlc/tvm] [RFC][AUTOTVM] Auto-Scheduler from Compute Decleration (#2954)
Date Tue, 02 Apr 2019 18:10:51 GMT
# Auto-Scheduler
TVM decouples kernel implementation into compute and schedule. The compute part is a friendly
DSL that can describe algorithms intuitively. However, the schedule part still requires strong
expert knowledge and time-consuming tuning to provide decent performance. The tuning process
is partially automated by the existing autotvm package, but a human-engineered template is
still required.

This RFC proposes a "real" autotvm, which we can call auto scheduler. It aims at removing
all human efforts on schedule part.

# Proposed Design 
The auto-scheduler is built on the exsiting autotvm package. It will generate a template from
compute decleration. Then this template can either be 

* Statically filled by heuristic rules and cost functions to provide reasonable performance,
* Dynamically tuned by autotvm to provide better performance with some time budget

The auto-scheduler takes a computation graph described by tvm DSL as input, then classify
the type of read/write patterns and the type of computation. It dispatches the declarations
to different "meta templates". The "meta templates" generates autotvm templates from the declaration.
There are four types of meta templates : simple reduction, complex reduction, direct compute,
and location-tunable compute. The auto-scheduler will do parallelization, vectorization, tiling,
and operator fusion.

The code is available on [my branch](
The current implementation is in pure python bacuse autotvm is mainly written in python. But
move the whole autotvm package to c++ is within long-term plan. The code is organized as follows.
* Analysis on access pattern [python/tvm/autotvm/auto_schedule/](
* CPU backend [python/tvm/autotvm/auto_schedule/backend/](
* GPU backend [python/tvm/autotvm/auto_schedule/backend/](
* Configuration for the auto-scheduler [python/tvm/autotvm/auto_schedule/](
* Experimental auto-packing for optimizing vectorization and locality [python/tvm/autotvm/auto_schedule/](
* Test case [tests/python/unittest/](

## API
There are only two user-oriented API calls

* `autotvm.AutoSchedulerOptions(**kwargs)`
This is used to configure the auto scheduler. The arguments include hardware configurations(vector
lanes, number of threads, size of shared memory, etc) and tuning configurations (how many
tuning knobs to generate).
* `autotvm.create_schedule(tensors)`
This is similar to `tvm.create_schedule`, but returns an already optimized schedule.

A = tvm.placeholder((128,), name='A')
B = tvm.placeholder((128,), name='B')
C = tvm.compute((128,),  lambda i: A[i] + B[i] * 2)

    with autotvm.AutoSchedulerOptions(vec_size=8, num_threads=16):
        s, bufs = autotvm.create_schedule([A, B, C])


func =, bufs)

# Examples
1. [Tutorial](
   This is a tutorial on how to statically use the auto-scheduler or auto-tune it.
2. [Schedule a whole network](
   This example is adopted from #2498. It is a LeNet like convolution neural network written
purely by tvm (without graph IR). The auto-scheduler also provides basic operator fusion for
it. Right now we can only run forward pass. I am working on fixing the backward pass.

# Performance
One reachable performance goal is to replace more than 90% schedule code in existing TOPI
by this auto-scheduler. I haven't done the experiments, but I believe the generated templates
cover the existing search space for most operators (includes conv2d, reduction, ...).

Another part of the goal is to provide reasonable static performance. In the "Schedule a whole
network" example, for batched forward pass, the current performance is 1.2x slower than out-of-the-box
TF + Keras, and 10x faster than naive schedule (fuse and parallel outer loop) on an Intel
i7-8750H. For static usage, the input of the auto-scheduler are parameters for heuristic rules
and hardware configurations. We will gather all inputs into a global config, so users can
still do some quick "tuning".

# Todo list
 - [ ] Performance test and improvement to cover more than 90% schedule code in TOPI
       Improve the heuristic rules to provide better static performance, do test to make sure
we covor the search space of exsting templates.
 - [ ] Improve tuning speed
       The current implementation does analysis and generates the template on the fly, which
is expensive and redundant during batched tuning. We should decouple the template generation
and template tuning, and explicitly cache the template.
 - [ ] (long-term) Move all autotvm related code to c++
 - [ ] Improve loop partition to better handle partial tile, vectorization.

You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message