From dev-return-14698-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Tue Sep 17 08:15:48 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 09EE2180645 for ; Tue, 17 Sep 2019 10:15:47 +0200 (CEST) Received: (qmail 71628 invoked by uid 500); 17 Sep 2019 08:15:46 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 71614 invoked by uid 99); 17 Sep 2019 08:15:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Sep 2019 08:15:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D2125180D9A for ; Tue, 17 Sep 2019 08:15:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2.5 X-Spam-Level: X-Spam-Status: No, score=-2.5 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=python.org Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id uXdKfmzPf-3n for ; Tue, 17 Sep 2019 08:15:43 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=188.166.95.178; helo=mail.python.org; envelope-from=antoine@python.org; receiver= Received: from mail.python.org (mail.python.org [188.166.95.178]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 09470BC509 for ; Tue, 17 Sep 2019 08:15:43 +0000 (UTC) Received: from [192.168.1.98] (221-98-190-109.dsl.ovh.fr [109.190.98.221]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.python.org (Postfix) with ESMTPSA id 46XbX81d9zzncyy for ; Tue, 17 Sep 2019 04:15:36 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=python.org; s=200901; t=1568708136; bh=opJhgSS0P9C6ET56KJ2G6Op+8Ly6+UN1WnVfiI+Etvk=; h=Subject:To:References:From:Date:In-Reply-To:From; b=IS174CkzEmohVJQm3RJ6jxKrdutKaZSjoub8E8ZCWdaqltZQByyJcvmU+1AtqCaGf f+5dLhbVlV4VhyBekK9Raz6peoZDcbR3ugyrBNPTS4Igf31E59ULsEg53xq9H2xVpc x0spC9JeNz35+87PpX7dkEiFkbmZt8PgsmrOsbtE= Subject: Re: [DISCUSS][C++] Rethinking our current C++ shared library (.so / .dll) approach To: dev@arrow.apache.org References: <20190917.113446.1393729608527228924.kou@clear-code.com> <1189bca3-0d1e-492b-92b8-4915af821ee0@www.fastmail.com> From: Antoine Pitrou Openpgp: preference=signencrypt Autocrypt: addr=antoine@python.org; prefer-encrypt=mutual; keydata= mQINBFQIqH8BEADUlB6Q7oEmm535PJ8ZebpN0buM4zFEHDMOukMfuoz9bBN0rVvvYRfXv9ID EYR1cHcie8oMudeXgHpZJ7M6KJPrHDOeR66dw+M5BYUhy1dJGaKSNYST9iXHuRrS21yhbBaG 7JhAuTE/qDiDNztu9q94Kw4vkrK8xuoAy9fQWIfSPPhQHFctA/NlTOC5CcRaWE7MQWU3XgH8 5VcaO0I7Ri7C2shkzGZAuns3owlmRSlkS1sMtnh2UEl2QBy2ckLGjaNB6aSlqnfOnwE3iodR nScgkAv7hvV/DePO/xNZQjWYynRZLdgCj+UQd5UGd/gTv0M0lqOCNsdkVDPA6VSvR8z0x8Sr MwpKXwz0sEeISeoY64EBVx9AhA/p6NaE8cLi7XCQI9iOCe6FWj89FpBfLx853glZOWlO5G/F nOYycB/zWLyGcRG5M1jVOsvccthQzeLKOqRZQ+J5ohZ5czM6xMcq1wm1a3SgdrIS/RsmCQWg 3EgZQDgNttJC0wDPcd5PmwXSJ23lDfiJ6xoUdwrhkkgdlQLDVLxsVP90E4iEZeSOIwzCIGTu mmYx9R83BomN8S9qj2ZfRXomYDGpYI5CSs08MClTPdSbA+3alviu4cqC/4eeagE4U5UZwdI7 0HKjPZD4Y0bFHttr6w5tZ60IEhEbboZMMIM5Q80Iv2nZ45biYwARAQABtCVrZXliYXNlLmlv L3BpdHJvdSA8cGl0cm91QGtleWJhc2UuaW8+iQItBBMBCgAXBQJUCKh/AhsvAwsJBwMVCggC HgECF4AACgkQdpvcISnCZYy1dw//VNDTsAuja/JoFiypTqXsVS9WBJjaZ3Vcu5x6ydn2MVms JUOnTmcXMsJllN0xj+YBlNJheHOK/luh5koh3RPevV/agKUJjtqvND9vqtEH7JsnIeIbHNFa 1QwiEN8fN/OCxWRBJ2CR+igPqsZEuy1cYg3IHRsKoHSBYGRIcvO5//pqp7ZxTaLyhHzUYrMZ fhCZ5Vc6TlIoKMrtTDhkR+anCZ6SBn/GwE2O8O+METPQBhkwbtlm/Qyzw0MbvfJiyKKfb0zR PmB+K0Ah0auauPWhda+1+b09h3sNnuQFDoGndwqURbJvGXG5fisQSPHwrWgU9cHbRjo/QjSY 0tH73WJkFRFo7NB5uvvjJSW3upX9qnjnxySfnzG6vAtiloxZ5cvcgZjQMldErslQ/eC7Nc8T KqM0Ku1wcHPeKWlYjxsV+U4Ae0MTKm5r02zmPZyAmS/FlvzKtuAO4KeDF8UxphiwUeKrSQ3i MQc17bxFJkwIIBDUgm7S7XFlSKS5tWBeSLckHwu6F57lKGENlJoJE31Tfo4EM97yd4nqFkzn nMrq54pnMSECUdZh7W0zIph69X/7L3D8AoD4AxiVNE/EzkZQ5B5m8vac2eQK8eHxzcSu1MtV OlqHvXqU3LpXMV5iuNfnYw0M/FVZVPSliolGTByNe+m/GviZHXOo1/hC2SmuAyu5AQ0EVAio fwEIANgh7945oQ5pT6Idaq6MMe+6cgnrxHirdOFbOhELqQEh1uLtFVtgzxf8iEzAbZgKVwSC Cl7lpvHMTLIib1q4EPLYBBTREHe3PfJ7aszxOQTVYJ0VD6752VUTd9jr0ueALbbki8zUNoRP 8iFocMnlKMrdS0A6iAQyk6JUoCHGjsL3uwNUvIeNshhinawLgiIbC1v+Vwyp2JqI2j1MZUfy ekS6TSYpESvWmsYum2w1xXctP6bVDTN1IL0ANZ1w/5h1+YpZBFN4syf8bRvGj89VrFLyQ2Gj GF7qqpPF8wOfkvXndNeyP1BL9SNE2JXuS58K+yyvadXmuDUl67evYHUsWPEAEQEAAYkDRAQY AQoADwUCVAiofwUJDwmcAAIbAgEpCRB2m9whKcJljMBdIAQZAQoABgUCVAiofwAKCRBDM7/d tpZBeIX7CACio47vOUTVMojsOcpmdlZReSsrjeOBnCCACheYV+R/ZQCjVevu1vO50dTG/Wsg RYYEXkEzcmXTpTbltmIhtzpT/66lNcgrIVCE5ln/Zm+OBlpcUDpYawL662JePo1TzUnrfRlo TwC3ahM/RqGbLXLFP35QxjyY1261WR8KMZN4/JqwSzirIjfMF7h27p+lvQJXG33kD3SkDwbZ 5tuSbvrsNiry+uaAlvrJEaQkb1AtB5e6IBRRFwNmaD0ltv6ohDpN0nOV+RUFCE2rB1VomvpK er4AvQGFforVPy8O6N7+ypqEmZk0FUFgp2nZ2qqyuPybactqmH6WTNhXY6bCddVPJEoP/1hT Yn9PSynGPb8t87D3YIR4FFyKhJf6D3kYgSTfc99lzlcCe9TwCGtoux1jhwDKS/u7HMaTJfSd grAfIpi9txnNFh/2gHNa23bYR+VFBlWqW/ItJG3+r38vbtAeA0ZhryydWxyI7M4xAXSHiiqP MGezgk+9jv9lpWft40Zwii4LMIiFyT5kad+Bvqn4LjvNx/8NS6o7mL80IvTsbmSV48Dge4tF ZiBRQAujHIWLr+NEMbScwipce+H3iFDDS8hpHuUwlGGM42spXDd8PtecUpk3aCTYinuZ6VSE pIRK5SlNm7hTPQTxptUByCBm4Et4obVvcuK9hQ88RTz+QIFc3p7G4Se162zX8klmtllXWGhy xbHbqtntteMyRvcC7hxSTYYR3mbw8QaqsOz+eKSbbg9+Q7briZ7dcgf1DuvUxibQYhqtycy2 Ozc44nmCxlzV22/vWswD2r/TkCQu/wtTm/ZNrc40G3cRSgu3ewlCl3E5hnzWmzMB/mjd5gMq blHJxdOB4u2w6KG9w1P0oQ1TMls5SsG6Ev6Ja5huEczptG97LEbWw62gUR/mcQ/bgBYOTwBY GkFXqcf9h56zETpUNw60KGl3meA3ZEQ3bdQgwPyOnPS2EcAVLq7hDoFbz8WiiImmNZMYdPSr XoYKGkTLiTxw5xtgHmHUPXtmqNJfrcbpuQENBFQIqH8BCACxfvvYStbXDSYVEK31kE7J3vpF J0TQ9V1/rYJLS5ji/gF1pb1dOYJW9oJuy2JnpsCcdfiDtaQnF7PJdyyBrNVrD4uTk+5/ynP8 +cFLjtNmK/Drd1Z7XNbVYw88Y+2EgFSZAGxROVxHUDceT0TtKfkisjK2vFJgR2ycJQH5gGek rIRMg4Cyl4SOzShF4p9RFVni42ZKCn42Q/7uR18ph0ZTWveW4pNC0vxy/XeUCXXillchSHWe RxNy35ZkDpzjpyHmJn5xaHiMUOqL0PyIxwxMIr3wuc+2Cl3r969vrTvqmkOVIUFLJReMLcCp ZhJYBXwrr3G/C7EdjTlW475c3eNjABEBAAGJA0QEGAEKAA8FAlQIqH8FCQ8JnAACGwwBKQkQ dpvcISnCZYzAXSAEGQEKAAYFAlQIqH8ACgkQs0Q9SUv4ymSxVgf/aVTRjVcW0Tahmm1cFm3y Lvk9zOkGmGdZTxGteQApUwucgM7KKYu0S0LRMcLqMmimZU6G48DMJa0N9sXzIp6LbliG7GBF QvZ2QPMBuBKHm5JiZwQ4CAjdm5/hiwJDA9PTAnxl0gF4DAnMl+sktofS4843AvASwdbx0A1q mAev+zVqdC3XznFYaSv6a0qxMJPSzMlEuq8/gjgBtbKwswuirrZp4ffFApc7lVrYcaRQh0j4 Xu5T/Q0Eb3v8XD2xBkPJppl3MEWq6loJBnrGyN8pT5rpPkWY0FQdkGBkYKMNz22iis9kQu9H yCcwrgufAJHVQ0RgcHn9Gs+yVHbttUodyTFCD/4xJcPcNmmfZYAx4El7Ob6IVpsa7O57Mljp 3MgKWxi+/s5/qNDDc6mTMJ0H9snp4DJEqLFTMIGN9sO+oa2c7CFyiB/jaie/hMdH7v8LeDxg Wq2wV7mNPTpKzX2dCTbKOU6DGAHauVqyzrF7osqH4czJNppv4e1/U3k7cjR5ui1i/zI4DxLr QHGavyJ6F/DGQXeDv8RizB2OV3qWXzSkwhqfVCadGqVnYR0ONUSjk+MnsmVPa1K7+x1WzRUV 90Uw29naj1KgLjoAtLicgRBsk95TGCRTLfqivqq5XtTQgi1L3OlCpRNym9RP63sSFg4u2CHD J550+/lZ1JKAX0+4T5bxjNQvKJPy9+lP+pgBBV3dVYZqU/6g5JVPT/2M0yZRMSWC/9fEI6xq CPb/5REu95qfm2p/qIAoN2JnXiF6aITdS3JNkY7tYfXo2WnCE8O3pWOrbbfTtwKVLccZH7So fj46U5ZtUfZoa4EuI9LdkqRg3N6npT/yP6ij+w7ti/dYgCP6tmRQExSC4YQt6V7SKEyHuW7m rkHWEg1/ldxyreuKDq37Pm6HiapYItnoXwQhoFNOr1vEqhPgABYFJw1ZB+2vn99sKIKlSUtk 7lYOVdexznPIkEibye/+oBVGs1KkawVT58d7UzH6C/l3BI/6narZBtNe84BR0briZf3euDMZ bA== Message-ID: <392f6731-0e74-e51e-e9c3-7dbc42a30ed7@python.org> Date: Tue, 17 Sep 2019 10:15:35 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <1189bca3-0d1e-492b-92b8-4915af821ee0@www.fastmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit I agree with Uwe that becoming more monolithic than we already are may become a big PR problem at some point. Regards Antoine. Le 17/09/2019 à 09:41, Uwe L. Korn a écrit : > Hello, > > I'm actually against this proposal. > > My main concern is at the moment that Arrow C++/Python grows to a really heavy tool where you always have to bring along all baggage even when you're only using a small part of it. This is a problem which makes it harder to use Arrow in projects because: > > * Simply the sheer size, the more dependencies the full build has, we grow further in the size of the installable. > * Having a large number of dependencies also means that you will need to take care of security scanning of all of these in production settings. Even when you're not using the parts, you will need to check for version updates, correct licenses and origin of the dependencies. Having a more modular is much simpler than mastering the art of convincing corporate IT. > * Defining dependencies from third-party libraries gets less transperant. When a library depends just on a large libarrow.so and starts with a missing symbol error, a user is confused and might think that the Arrow installation is corrupt whereas if the error reports that libarrow_flight.so is missing, he is much more aware that his local build is one without Flight being built. > > I would actually like to see the pyarrow packages split up into several packages in the future, making the C++ part a single shared object would quite hinder this. I don't have the resources to move forward with this now but as I know that I will need this, I'm going to want to implement this sometime. > > Uwe > > On Tue, Sep 17, 2019, at 6:22 AM, Micah Kornfield wrote: >> I don't have a strong opinion here, but had a question and comment: >> >> Are there are implications from a project governance perspective of >> packaging Parquet and Arrow into a single shared library? >> >> As a comment, but I'm a big +1 on trying to tease apart the circular >> dependencies between Parquet/Arrow (and any other modules). As noted >> above, I think this boils down to isolating IO and Buffer data structures >> into 1 library and having the Arrow Array data structures in their own >> separate libraries. >> >> Thanks, >> Micah >> >> On Mon, Sep 16, 2019 at 7:35 PM Sutou Kouhei wrote: >> >>> Hi, >>> >>> If this is circular, it's a problem. But this isn't circular >>> for now. >>> >>> I think that we can use libarrow as the fundamental shared >>> library to provide common implementation like [1] if we need >>> to provide common implementation for template. (I think that >>> we don't provide common implementation for template.) >>> >>> [1] >>> https://github.com/apache/arrow/pull/5221/commits/e88b2579f04451d741eeddcb6697914bcc1019a6 >>> >>> Anyway, I'm not strongly oppose to this idea. If we choose >>> one shared library approach, Linux packages, GLib bindings >>> and Ruby bindings can follow the change. >>> >>> >>> Thanks, >>> -- >>> kou >>> >>> In >>> "Re: [DISCUSS][C++] Rethinking our current C++ shared library (.so / >>> .dll) approach" on Thu, 12 Sep 2019 13:23:01 -0500, >>> Wes McKinney wrote: >>> >>>> One thing I forgot to mention: >>>> >>>> One of the things driving the creation of new shared libraries is >>>> interdependencies. For example: >>>> >>>> libarrow -> libparquet >>>> libarrow -> libarrow_dataset >>>> libparquet -> libarrow_dataset >>>> >>>> With the modular LLVM-like approach this issue goes away. >>>> >>>> On Thu, Sep 12, 2019 at 1:16 PM Wes McKinney >>> wrote: >>>>> >>>>> I forgot to add the link to the LLVM library listing >>>>> >>>>> https://gist.github.com/wesm/d13c2844db0c19477e8ee5c95e36a0dc >>>>> >>>>> On Thu, Sep 12, 2019 at 1:14 PM Wes McKinney >>> wrote: >>>>>> >>>>>> hi folks, >>>>>> >>>>>> I wanted to share some concerns that I have about our current >>>>>> trajectory with regards to producing shared libraries from the Arrow >>>>>> build system. >>>>>> >>>>>> Currently, a comprehensive build produces many shared libraries: >>>>>> >>>>>> * libarrow >>>>>> * libarrow_dataset >>>>>> * libarrow_flight >>>>>> * libarrow_python >>>>>> * libgandiva >>>>>> * libparquet >>>>>> * libplasma >>>>>> >>>>>> There are some others. There are a number of problems with the >>> current approach: >>>>>> >>>>>> * Each DLL needs its own set of "visibility" macros to control the use >>>>>> of __declspec(dllimport/dllexport) on Windows, which is necessary to >>>>>> instruct the import or export of symbols between DLLs on Windows. See >>>>>> e.g. >>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/visibility.h >>>>>> >>>>>> * Templates instantiated in one DLL may cause a violation of the One >>>>>> Definition Rule during linking (we lost at least a day of work time >>>>>> collectively to issues around this in ARROW-6244). It is good to be >>>>>> able to share common template interfaces in general >>>>>> >>>>>> * Statically-linked dependencies in one shared lib may need to be >>>>>> statically linked into another library. For example, libgandiva >>>>>> statically links parts of LLVM, but we will likely have some other >>>>>> code that makes use of LLVM for other purposes (it has been discussed >>>>>> in the context of Avro parsing) >>>>>> >>>>>> Overall, my preferred solution to these issues is to move to a similar >>>>>> approach to what the LLVM project does. To help understand, let me >>>>>> have you first look at the libraries that come from the llvm-7-dev >>>>>> package on Ubuntu >>>>>> >>>>>> Here we have a collection of static "module" libraries that implement >>>>>> different parts of the LLVM platform. Finally, a _single_ shared >>>>>> library libLLVM-7.so is produced. >>>>>> >>>>>> I think we should do the same thing in Apache Arrow. So we only ever >>>>>> will produce a single shared library from the build. We can >>>>>> additionally make the "name" of this shared library configurable to >>>>>> suit different needs. For example, the default name could be simply >>>>>> "libarrow.so" or something. But if someone wants to produce a >>>>>> barebones Parquet shared library they can override the name to create >>>>>> a "libparquet.so" that contains only the "libarrow_core.a" and >>>>>> "libarrow_io.a" symbols needed for reading Parquet files. >>>>>> >>>>>> This would have additional benefits: >>>>>> >>>>>> * Use the same visibility macros for all exported C++ symbols, rather >>>>>> than having to define DLL-specific visibility >>>>>> >>>>>> * Improved modularization of builds and linking for third party users, >>>>>> similar to the way that LLVM's modular linking works, see the way that >>>>>> Gandiva requests specific components from LLVM to use for static >>>>>> linking >>> https://github.com/apache/arrow/blob/master/cpp/cmake_modules/FindLLVM.cmake#L53 >>>>>> >>>>>> * Net simpler linking and deployment. Only one shared library to deal >>> with >>>>>> >>>>>> There are some drawbacks, however: >>>>>> >>>>>> * Our C++ Linux packaging approach would need to be changed to be more >>>>>> LLVM-like (a single .deb/.yum package containing the C++ platform >>>>>> rather than many packages as now) >>>>>> >>>>>> Interested to hear from other C++ developers. >>>>>> >>>>>> Thanks >>>>>> Wes >>> >>