arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Malakhov, Anton" <anton.malak...@intel.com>
Subject [DISCUSS][C++][Proposal] Threading engine for Arrow
Date Wed, 01 May 2019 02:28:28 GMT
Hi dear Arrow developers, Antoine,

I'd like to kick off the discussion of the threading engine that Arrow can use underneath
for implementing multicore parallelism for execution nodes, kernels, and/or all the functions,
which can be optimized this way.
I've documented some ideas on Arrow's Confluence Wiki: https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+Engine
The bottom line is that while Arrow is moving into the right direction introducing shared
thread pool, there are some questions and concerns about current implementation and the way
how it is supposed to co-exist with other threaded libraries ("threading composability") while
providing efficient nestable NUMA&cache-aware data and data-flow parallelism.
I suggest to introduce threading layers like in other libraries like MKL and Numba, starting
with TBB-based layer. Or maybe even use TBB directly. In short, there are the following arguments
for it:

1.      Designed for composability from day zero. Avoids mandatory parallelism. Provides work
stealing and FIFO scheduling. Compatible with parallel depth first scheduling (a better composability
research).

2.      TBB Flow Graph. It fits nicely into data flow and execution nodes model of SQL databases.
Besides basic nodes needed for implementing an execution engine, it also provides a foundation
for heterogeneous and distributed computing (async_node, opencl_node, distributed_node)

3.      Arrow's ThreadPool, TaskGroup, and ParallelFor have direct equivalent in TBB: task_arena,
task_group, and parallel_for while providing mature and performant implementation, which solves
many if not all of the XXX todo notes in the comments like exceptions, singletons and time
of initialization, lock-free.

4.      Concurrent hash tables, queues, vector and other concurrent containers. Hash tables
are required for implementing parallel versions of joins, groupby, uniq, dictionary operations.
There is a contribution to integrate libcuckoo under TBB interface.

5.      TBB scalable malloc and memory pools, which can use any user-provided memory chunk
for scalable allocation. Arrow uses jemalloc, which is slower in some cases than tbbmalloc
or tcmalloc.

6.      OpenMP is good for NUMA with static schedule, however, there is no good answer for
dynamic tasks, graphs. TBB provides tools for implementing NUMA support: task_arena, task_scheduler_observer,
task affinity & priorities, committed to improve NUMA for its other customers in 2019.

7.      TBB is licensed under Apache 2.0, has conda-forge feedstock, supports CMake, it's
adopted for CPU scheduling by other industry players, has multiple ports for other OSes and
CPU arches.

Full disclosure: I was TBB developer before its 1.0 version, responsible for multiple core
components like hash tables, adaptive partitioning, interfaces of memory pools and task_arena,
all of these are very relevant to Arrow. I've background in scalability and NUMA-aware performance
optimization like what we did for OpenCL runtime for CPU (TBB-based). I also was behind optimizations
for Intel Distribution for Python and its threading composability story<https://software.intel.com/en-us/blogs/2016/04/04/unleash-parallel-performance-of-python-programs>.
Thus, I'm sincerely hope to reuse all these stuff in order to deliver the best performance
for Arrow.


Best regards,
Anton Malakhov<http://www.linkedin.com/in/antonmalakhov>
IAGS Scripting Analyzers & Tools

O: +1-512-3620-512
1300 S. MoPac Expy
Office:  AN4-C1-D4
Austin, TX 78746
Intel Corporation | www.intel.com


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message