arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [DISCUSS][C++][Proposal] Threading engine for Arrow
Date Mon, 06 May 2019 14:22:39 GMT
Anton, per your comment:

> Sounds like a good way to go! We'll create a demo, as you suggested, implementing a parallel
execution model for a simple analytics pipeline that reads and processes the files. My only
concern is about adding more pipeline breaker nodes and compute intensive operations into
this demo because min/max are effectively no-ops fused into I/O scan node. What do you think
about adding group-by into this picture, effectively implementing NY taxi and/or mortgage
benchmarks? Ideally, I'd like to go even further and add sci-kit learn like stuff for processing
that data in order to demonstrate the co-existence side of the story. What do you think? So,
the idea of the prototype will be to validate the parallel execution model as the first step.
After that, it'll help to shape API for both - execution nodes and the threading backend.
Does it sound right to you?

The question is whether you want to spend at least a month or more of
intense development on something else (a basic query engine, as we've
been discussing in [1]) before we are able to develop consensus about
the approach to threading. Personally, I would not make this choice
given that there are good options available to move along the
discussion of the parallelism API. I think Antoine's point about the
CSV reader as an example of a non-trivial processing pipeline that
also includes IO operations.

- Wes

[1]: https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing

On Fri, May 3, 2019 at 12:41 PM Jed Brown <jed@jedbrown.org> wrote:
>
> "Malakhov, Anton" <anton.malakhov@intel.com> writes:
>
> >> > the library creates threads internally.  It's a disaster for managing
> >> > oversubscription and affinity issues among groups of threads and/or
> >> > multiple processes (e.g., MPI).
> >
> > This is exactly what I'm talking about referring as issues with threading composability!
OpenMP is not easy to have inside a library. I described it in this document: https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+Engine
>
> Thanks for this document.  I'm no great fan of OpenMP, but it's being
> billed by most vendors (especially Intel) as the way to go in the
> scientific computing space and has become relatively popular (much more
> so than TBB).
>
> You linked to a NumPy discussion
> (https://github.com/numpy/numpy/issues/11826) that is encountering the
> same issues, but proposing solutions based on the global environment.
> That is perhaps acceptable for typical Python callers due to the GIL,
> but C++ callers may be using threads themselves.  A typical example:
>
> App:
>   calls libB sequentially:
>     calls Arrow sequentially (wants to use threads)
>   calls libC sequentially:
>     omp parallel (creates threads somehow):
>       calls Arrow from threads (Arrow should not create more)
>   omp parallel:
>     calls libD from threads:
>       calls Arrow (Arrow should not create more)
>
> Arrow doesn't need to know the difference between the libC and libD
> cases, but it may make a difference to the implementation of those
> libraries.  In both of these cases, the user may desire that Arrow
> create tasks for load balancing reasons (but no new threads) so long as
> they can run on the specified thread team.
>
> I have yet to see a complete solution to this problem, but we should
> work out which modes are worth supporting and how that interface would
> look.
>
>
> Global solutions like this one (linked by Antoine)
>
>   https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread-pool.cc#L268
>
> imply that threading mode is global and set via an environment variable,
> neither of which are true in cases such as the above (and many simpler
> cases).

Mime
View raw message