arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoine Pitrou <anto...@python.org>
Subject Re: [DISCUSS][C++] Rethinking our current C++ shared library (.so / .dll) approach
Date Tue, 17 Sep 2019 08:15:35 GMT

I agree with Uwe that becoming more monolithic than we already are may
become a big PR problem at some point.

Regards

Antoine.


Le 17/09/2019 à 09:41, Uwe L. Korn a écrit :
> Hello,
> 
> I'm actually against this proposal.
> 
> My main concern is at the moment that Arrow C++/Python grows to a really heavy tool where
you always have to bring along all baggage even when you're only using a small part of it.
This is a problem which makes it harder to use Arrow in projects because:
> 
> * Simply the sheer size, the more dependencies the full build has, we grow further in
the size of the installable.
> * Having a large number of dependencies also means that you will need to take care of
security scanning of all of these in production settings. Even when you're not using the parts,
you will need to check for version updates, correct licenses and origin of the dependencies.
Having a more modular is much simpler than mastering the art of convincing corporate IT.
> * Defining dependencies from third-party libraries gets less transperant. When a library
depends just on a large libarrow.so and starts with a missing symbol error, a user is confused
and might think that the Arrow installation is corrupt whereas if the error reports that libarrow_flight.so
is missing, he is much more aware that his local build is one without Flight being built.
> 
> I would actually like to see the pyarrow packages split up into several packages in the
future, making the C++ part a single shared object would quite hinder this. I don't have the
resources to move forward with this now but as I know that I will need this, I'm going to
want to implement this sometime.
> 
> Uwe
> 
> On Tue, Sep 17, 2019, at 6:22 AM, Micah Kornfield wrote:
>> I don't have a strong opinion here, but had a question and comment:
>>
>> Are there are implications from a project governance perspective of
>> packaging Parquet and Arrow into a single shared library?
>>
>> As a comment, but I'm a big +1 on trying to tease apart the circular
>> dependencies between Parquet/Arrow (and any other modules).  As noted
>> above, I think this boils down to isolating IO and Buffer data structures
>> into 1 library and having the Arrow Array data structures in their own
>> separate libraries.
>>
>> Thanks,
>> Micah
>>
>> On Mon, Sep 16, 2019 at 7:35 PM Sutou Kouhei <kou@clear-code.com> wrote:
>>
>>> Hi,
>>>
>>> If this is circular, it's a problem. But this isn't circular
>>> for now.
>>>
>>> I think that we can use libarrow as the fundamental shared
>>> library to provide common implementation like [1] if we need
>>> to provide common implementation for template. (I think that
>>> we don't provide common implementation for template.)
>>>
>>> [1]
>>> https://github.com/apache/arrow/pull/5221/commits/e88b2579f04451d741eeddcb6697914bcc1019a6
>>>
>>> Anyway, I'm not strongly oppose to this idea. If we choose
>>> one shared library approach, Linux packages, GLib bindings
>>> and Ruby bindings can follow the change.
>>>
>>>
>>> Thanks,
>>> --
>>> kou
>>>
>>> In <CAJPUwMDWENCjPBw+HrSWAOJFez7e_yci-Fg2D3LwgVNCf45iWQ@mail.gmail.com>
>>>   "Re: [DISCUSS][C++] Rethinking our current C++ shared library (.so /
>>> .dll) approach" on Thu, 12 Sep 2019 13:23:01 -0500,
>>>   Wes McKinney <wesmckinn@gmail.com> wrote:
>>>
>>>> One thing I forgot to mention:
>>>>
>>>> One of the things driving the creation of new shared libraries is
>>>> interdependencies. For example:
>>>>
>>>> libarrow -> libparquet
>>>> libarrow -> libarrow_dataset
>>>> libparquet -> libarrow_dataset
>>>>
>>>> With the modular LLVM-like approach this issue goes away.
>>>>
>>>> On Thu, Sep 12, 2019 at 1:16 PM Wes McKinney <wesmckinn@gmail.com>
>>> wrote:
>>>>>
>>>>> I forgot to add the link to the LLVM library listing
>>>>>
>>>>> https://gist.github.com/wesm/d13c2844db0c19477e8ee5c95e36a0dc
>>>>>
>>>>> On Thu, Sep 12, 2019 at 1:14 PM Wes McKinney <wesmckinn@gmail.com>
>>> wrote:
>>>>>>
>>>>>> hi folks,
>>>>>>
>>>>>> I wanted to share some concerns that I have about our current
>>>>>> trajectory with regards to producing shared libraries from the Arrow
>>>>>> build system.
>>>>>>
>>>>>> Currently, a comprehensive build produces many shared libraries:
>>>>>>
>>>>>> * libarrow
>>>>>> * libarrow_dataset
>>>>>> * libarrow_flight
>>>>>> * libarrow_python
>>>>>> * libgandiva
>>>>>> * libparquet
>>>>>> * libplasma
>>>>>>
>>>>>> There are some others. There are a number of problems with the
>>> current approach:
>>>>>>
>>>>>> * Each DLL needs its own set of "visibility" macros to control the
use
>>>>>> of __declspec(dllimport/dllexport) on Windows, which is necessary
to
>>>>>> instruct the import or export of symbols between DLLs on Windows.
See
>>>>>> e.g.
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/visibility.h
>>>>>>
>>>>>> * Templates instantiated in one DLL may cause a violation of the
One
>>>>>> Definition Rule during linking (we lost at least a day of work time
>>>>>> collectively to issues around this in ARROW-6244). It is good to
be
>>>>>> able to share common template interfaces in general
>>>>>>
>>>>>> * Statically-linked dependencies in one shared lib may need to be
>>>>>> statically linked into another library. For example, libgandiva
>>>>>> statically links parts of LLVM, but we will likely have some other
>>>>>> code that makes use of LLVM for other purposes (it has been discussed
>>>>>> in the context of Avro parsing)
>>>>>>
>>>>>> Overall, my preferred solution to these issues is to move to a similar
>>>>>> approach to what the LLVM project does. To help understand, let me
>>>>>> have you first look at the libraries that come from the llvm-7-dev
>>>>>> package on Ubuntu
>>>>>>
>>>>>> Here we have a collection of static "module" libraries that implement
>>>>>> different parts of the LLVM platform. Finally, a _single_ shared
>>>>>> library libLLVM-7.so is produced.
>>>>>>
>>>>>> I think we should do the same thing in Apache Arrow. So we only ever
>>>>>> will produce a single shared library from the build. We can
>>>>>> additionally make the "name" of this shared library configurable
to
>>>>>> suit different needs. For example, the default name could be simply
>>>>>> "libarrow.so" or something. But if someone wants to produce a
>>>>>> barebones Parquet shared library they can override the name to create
>>>>>> a "libparquet.so" that contains only the "libarrow_core.a" and
>>>>>> "libarrow_io.a" symbols needed for reading Parquet files.
>>>>>>
>>>>>> This would have additional benefits:
>>>>>>
>>>>>> * Use the same visibility macros for all exported C++ symbols, rather
>>>>>> than having to define DLL-specific visibility
>>>>>>
>>>>>> * Improved modularization of builds and linking for third party users,
>>>>>> similar to the way that LLVM's modular linking works, see the way
that
>>>>>> Gandiva requests specific components from LLVM to use for static
>>>>>> linking
>>> https://github.com/apache/arrow/blob/master/cpp/cmake_modules/FindLLVM.cmake#L53
>>>>>>
>>>>>> * Net simpler linking and deployment. Only one shared library to
deal
>>> with
>>>>>>
>>>>>> There are some drawbacks, however:
>>>>>>
>>>>>> * Our C++ Linux packaging approach would need to be changed to be
more
>>>>>> LLVM-like (a single .deb/.yum package containing the C++ platform
>>>>>> rather than many packages as now)
>>>>>>
>>>>>> Interested to hear from other C++ developers.
>>>>>>
>>>>>> Thanks
>>>>>> Wes
>>>
>>

Mime
View raw message