mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tianqi Chen <notificati...@github.com>
Subject Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)
Date Mon, 23 Dec 2019 01:35:03 GMT
After some thoughts along the direction, I find a better and fun answer to the above question:
support tuple/ellipsis/slice in tvm ffi effectively.

I quickly hacked up a POC in https://github.com/tqchen/tvm/tree/pyffi that supports the following
benchmark script(disclaimer: it is only a POC so not intended for use or fully optimized,
but it demonstrates all the technical flows necessary to make a fully functioning FFI).

```python
import timeit
import tvm
nop = tvm._api_internal._nop

setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                                  stmt='nop((None,..., slice(0, 100, 2)))')
timer.timeit(1)
num_repeat = 1000
print("tvm.tuple_slice_ellipsis_combo:", timer.timeit(num_repeat) / num_repeat)


setup = """
import numpy as np
"""

timer = timeit.Timer(setup=setup,
                                  stmt='np.empty((1,2,1))')
timer.timeit(1)
print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat)

setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                                  stmt='nop("mystr")')
timer.timeit(1)
num_repeat = 1000
print("tvm.str_arg:", timer.timeit(num_repeat) / num_repeat)
```

On my laptop(macbook 13inch), the results are as follows
```
$ TVM_FFI=cython python benchmark_ffi.py
tvm.tuple_slice_ellipsis_combo: 4.615739999999924e-07
numpy.emmpty: 2.7016599999998834e-07
tvm.str_arg: 2.3390799999997714e-07
```

##  What is Implemented in the POC 

In the POC, we introduced specific objects for Ellipsis, Slice and Tuple(already supported
in ADT). During a PackedFunc call, a python tuple/ellipsis/slice was  converted into the object
that is supported by the backend. We implemented a cython version(the previous recursive conversion
was in python) to back it up. 

The reason that we are able to create Object in the cython side is because all TVM object
has been recently converted to be POD-C compatible, so the object can be created in the cython
side without crossing DLL boundary and passed to the c++ backend.

We can see from the benchmark that the cost of such deep-copy was at a reasonable level. We
also only used the default memory allocator, so there could be space for further improvements.

##  Discussions

Please also see tradeoff discussions in the last post. As we can see, the main difference
here is where to do the conversion, and whether do we do lazy/deep copy:

- In the case of pybind: conversion is happened in the c++ side, data structures are lazily
created.
- In the case of the POC: conversion is happened in cython, data structures are deeply translated
into another in-memory format.

The laziness certainly avoids a copy in cases where we do not necessarily need to book-keep
the created argument. On the other hand, supporting a common data structure in the c++ side
means the binding can potentially be reused by other language frontends.










-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568325041
Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message