Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 50B77200C27 for ; Sun, 26 Feb 2017 20:25:22 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 4F376160B78; Sun, 26 Feb 2017 19:25:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 71E8D160B6E for ; Sun, 26 Feb 2017 20:25:21 +0100 (CET) Received: (qmail 99622 invoked by uid 500); 26 Feb 2017 19:25:20 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 99598 invoked by uid 99); 26 Feb 2017 19:25:20 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Feb 2017 19:25:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id CE5DF1A0408 for ; Sun, 26 Feb 2017 19:25:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.379 X-Spam-Level: X-Spam-Status: No, score=0.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Dit4DNcN8-Ez for ; Sun, 26 Feb 2017 19:25:17 +0000 (UTC) Received: from mail-ot0-f169.google.com (mail-ot0-f169.google.com [74.125.82.169]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id D34575F243 for ; Sun, 26 Feb 2017 19:25:16 +0000 (UTC) Received: by mail-ot0-f169.google.com with SMTP id i1so99211ota.3 for ; Sun, 26 Feb 2017 11:25:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=S15q6+tFjbsazG5RJ04CgXLAnk+/UnWHvBfVRR4Bn64=; b=smUIeAN0Z84vRyxwWsXgVEb36c0djiqj49tVPFRBhYHkEhBcEotY+9fdMZ4Ydcqq6r eE118ElqslcNj6SfkdnT+NzOMBW0vcm9JWaAU5B5mtnDNEWmCVwNuz4S96QYrFgOZqSB 1ZOoXerm2kqNvehkPQPslksuOJA7LdED1VisV3vz/LFf3Ila9KUNsV2AUeqODN+E1kAF xed/fa7alUvALwaQkHb9QMZ9CppByA4g36VN5wubiEWJF48/7/66tY/kTn+ZOlg5ts4J 3I0ZBABnyCeTwq0/rAEmICrZAfkn8weEclGn1/2/EI5PNfCDwYnTTXEuHIeN4oF1hqUT zqOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=S15q6+tFjbsazG5RJ04CgXLAnk+/UnWHvBfVRR4Bn64=; b=JTKuaDO4TkYBtgd9gtHmBlfM8bNF6bXMdDoLYLkQaZ8mK4Lgc9YJ/fK2GI6osKFghM bCgDGFWnwEthF5fHRLcLSvUdtxlPQoRU1lXqB2o9Ufrhd7usBBW3oTaqqJYfWV2v44Kd TCDZgfxk7NDTsdtI8sV99A3As0UjCfYlvTU2v80C0t03V7qNrWMrY++o91X1SGa6toBT z37XfoTz8IIJHaV3y92Oc44Z7yBz1vPiCThBJej0smhs8tAX5RCoqPgoLjqMGsWGlp/Y wczJXei6F9sGZaZXx7xadJCby6eAhBOQyepZzbuwGrgxVKoXwTitZBZbnqDQfIxN2GGN UAUg== X-Gm-Message-State: AMke39nrh0fwf2AOs+M3xuWvNSfDzVNys/gvE6UeH0tmHr/94eZfBLgwMuJEUi412Fo4ilJkDRRUovfBpGVQtQ== X-Received: by 10.157.73.138 with SMTP id g10mr1312904otf.56.1488137115523; Sun, 26 Feb 2017 11:25:15 -0800 (PST) MIME-Version: 1.0 Received: by 10.202.188.133 with HTTP; Sun, 26 Feb 2017 11:24:34 -0800 (PST) In-Reply-To: References: From: Wes McKinney Date: Sun, 26 Feb 2017 14:24:34 -0500 Message-ID: Subject: Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet} To: dev@kudu.apache.org Cc: dev@arrow.apache.org, dev@impala.incubator.apache.org, dev@parquet.apache.org Content-Type: text/plain; charset=UTF-8 archived-at: Sun, 26 Feb 2017 19:25:22 -0000 hi Miki, No, I don't think so. APR is a portable C library. The code we are talking about would be intended for use in C++11/14 projects like Impala and Kudu (and Arrow and Parquet). Wes On Sun, Feb 26, 2017 at 1:58 PM, Miki Tebeka wrote: > Can't some (most) of it be added to APR ? > > On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney wrote: > >> hi Henry, >> >> Thank you for these comments. >> >> I think having a kind of "Apache Commons for [Modern] C++" would be an >> ideal (though perhaps initially more labor intensive) solution. >> There's code in Arrow that I would move into this project if it >> existed. I am happy to help make this happen if there is interest from >> the Kudu and Impala communities. I am not sure logistically what would >> be the most expedient way to establish the project, whether as an ASF >> Incubator project or possibly as a new TLP that could be created by >> spinning IP out of Apache Kudu. >> >> I'm interested to hear the opinions of others, and possible next steps. >> >> Thanks >> Wes >> >> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson wrote: >> > Thanks for bringing this up, Wes. >> > >> > On 25 February 2017 at 14:18, Wes McKinney wrote: >> > >> >> Dear Apache Kudu and Apache Impala (incubating) communities, >> >> >> >> (I'm not sure the best way to have a cross-list discussion, so I >> >> apologize if this does not work well) >> >> >> >> On the recent Apache Parquet sync call, we discussed C++ code sharing >> >> between the codebases in Apache Arrow and Apache Parquet, and >> >> opportunities for more code sharing with Kudu and Impala as well. >> >> >> >> As context >> >> >> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the >> >> first C++ release within Apache Parquet. I got involved with this >> >> project a little over a year ago and was faced with the unpleasant >> >> decision to copy and paste a significant amount of code out of >> >> Impala's codebase to bootstrap the project. >> >> >> >> * In parallel, we begin the Apache Arrow project, which is designed to >> >> be a complementary library for file formats (like Parquet), storage >> >> engines (like Kudu), and compute engines (like Impala and pandas). >> >> >> >> * As Arrow and parquet-cpp matured, an increasing amount of code >> >> overlap crept up surrounding buffer memory management and IO >> >> interface. We recently decided in PARQUET-818 >> >> (https://github.com/apache/parquet-cpp/commit/ >> >> 2154e873d5aa7280314189a2683fb1e12a590c02) >> >> to remove some of the obvious code overlap in Parquet and make >> >> libarrow.a/so a hard compile and link-time dependency for >> >> libparquet.a/so. >> >> >> >> * There is still quite a bit of code in parquet-cpp that would better >> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding, >> >> compression, bit utilities, and so forth. Much of this code originated >> >> from Impala >> >> >> >> This brings me to a next set of points: >> >> >> >> * parquet-cpp contains quite a bit of code that was extracted from >> >> Impala. This is mostly self-contained in >> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util >> >> >> >> * My understanding is that Kudu extracted certain computational >> >> utilities from Impala in its early days, but these tools have likely >> >> diverged as the needs of the projects have evolved. >> >> >> >> Since all of these projects are quite different in their end goals >> >> (runtime systems vs. libraries), touching code that is tightly coupled >> >> to either Kudu or Impala's runtimes is probably not worth discussing. >> >> However, I think there is a strong basis for collaboration on >> >> computational utilities and vectorized array processing. Some obvious >> >> areas that come to mind: >> >> >> >> * SIMD utilities (for hashing or processing of preallocated contiguous >> >> memory) >> >> * Array encoding utilities: RLE / Dictionary, etc. >> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire >> >> contributed a patch to parquet-cpp around this) >> >> * Date and time utilities >> >> * Compression utilities >> >> >> > >> > Between Kudu and Impala (at least) there are many more opportunities for >> > sharing. Threads, logging, metrics, concurrent primitives - the list is >> > quite long. >> > >> > >> >> >> >> I hope the benefits are obvious: consolidating efforts on unit >> >> testing, benchmarking, performance optimizations, continuous >> >> integration, and platform compatibility. >> >> >> >> Logistically speaking, one possible avenue might be to use Apache >> >> Arrow as the place to assemble this code. Its thirdparty toolchain is >> >> small, and it builds and installs fast. It is intended as a library to >> >> have its headers used and linked against other applications. (As an >> >> aside, I'm very interested in building optional support for Arrow >> >> columnar messages into the kudu client). >> >> >> > >> > In principle I'm in favour of code sharing, and it seems very much in >> > keeping with the Apache way. However, practically speaking I'm of the >> > opinion that it only makes sense to house shared support code in a >> > separate, dedicated project. >> > >> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope >> > of sharing to utilities that Arrow is interested in. It would make no >> sense >> > to add a threading library to Arrow if it was never used natively. >> Muddying >> > the waters of the project's charter seems likely to lead to user, and >> > developer, confusion. Similarly, we should not necessarily couple Arrow's >> > design goals to those it inherits from Kudu and Impala's source code. >> > >> > I think I'd rather see a new Apache project than re-use a current one for >> > two independent purposes. >> > >> > >> >> >> >> The downside of code sharing, which may have prevented it so far, are >> >> the logistics of coordinating ASF release cycles and keeping build >> >> toolchains in sync. It's taken us the past year to stabilize the >> >> design of Arrow for its intended use cases, so at this point if we >> >> went down this road I would be OK with helping the community commit to >> >> a regular release cadence that would be faster than Impala, Kudu, and >> >> Parquet's respective release cadences. Since members of the Kudu and >> >> Impala PMC are also on the Arrow PMC, I trust we would be able to >> >> collaborate to each other's mutual benefit and success. >> >> >> >> Note that Arrow does not throw C++ exceptions and similarly follows >> >> Google C++ style guide to the same extent at Kudu and Impala. >> >> >> >> If this is something that either the Kudu or Impala communities would >> >> like to pursue in earnest, I would be happy to work with you on next >> >> steps. I would suggest that we start with something small so that we >> >> could address the necessary build toolchain changes, and develop a >> >> workflow for moving around code and tests, a protocol for code reviews >> >> (e.g. Gerrit), and coordinating ASF releases. >> >> >> > >> > I think, if I'm reading this correctly, that you're assuming integration >> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via >> > their toolchains. For something as fast moving as utility code - and >> > critical, where you want the latency between adding a fix and including >> it >> > in your build to be ~0 - that's a non-starter to me, at least with how >> the >> > toolchains are currently realised. >> > >> > I'd rather have the source code directly imported into Impala's tree - >> > whether by git submodule or other mechanism. That way the coupling is >> > looser, and we can move more quickly. I think that's important to other >> > projects as well. >> > >> > Henry >> > >> > >> > >> >> >> >> Let me know what you think. >> >> >> >> best >> >> Wes >> >> >>