arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}
Date Sun, 26 Feb 2017 18:12:13 GMT
hi Henry,

Thank you for these comments.

I think having a kind of "Apache Commons for [Modern] C++" would be an
ideal (though perhaps initially more labor intensive) solution.
There's code in Arrow that I would move into this project if it
existed. I am happy to help make this happen if there is interest from
the Kudu and Impala communities. I am not sure logistically what would
be the most expedient way to establish the project, whether as an ASF
Incubator project or possibly as a new TLP that could be created by
spinning IP out of Apache Kudu.

I'm interested to hear the opinions of others, and possible next steps.

Thanks
Wes

On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <henry@apache.org> wrote:
> Thanks for bringing this up, Wes.
>
> On 25 February 2017 at 14:18, Wes McKinney <wesmckinn@gmail.com> wrote:
>
>> Dear Apache Kudu and Apache Impala (incubating) communities,
>>
>> (I'm not sure the best way to have a cross-list discussion, so I
>> apologize if this does not work well)
>>
>> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> between the codebases in Apache Arrow and Apache Parquet, and
>> opportunities for more code sharing with Kudu and Impala as well.
>>
>> As context
>>
>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> first C++ release within Apache Parquet. I got involved with this
>> project a little over a year ago and was faced with the unpleasant
>> decision to copy and paste a significant amount of code out of
>> Impala's codebase to bootstrap the project.
>>
>> * In parallel, we begin the Apache Arrow project, which is designed to
>> be a complementary library for file formats (like Parquet), storage
>> engines (like Kudu), and compute engines (like Impala and pandas).
>>
>> * As Arrow and parquet-cpp matured, an increasing amount of code
>> overlap crept up surrounding buffer memory management and IO
>> interface. We recently decided in PARQUET-818
>> (https://github.com/apache/parquet-cpp/commit/
>> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> to remove some of the obvious code overlap in Parquet and make
>> libarrow.a/so a hard compile and link-time dependency for
>> libparquet.a/so.
>>
>> * There is still quite a bit of code in parquet-cpp that would better
>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> compression, bit utilities, and so forth. Much of this code originated
>> from Impala
>>
>> This brings me to a next set of points:
>>
>> * parquet-cpp contains quite a bit of code that was extracted from
>> Impala. This is mostly self-contained in
>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>>
>> * My understanding is that Kudu extracted certain computational
>> utilities from Impala in its early days, but these tools have likely
>> diverged as the needs of the projects have evolved.
>>
>> Since all of these projects are quite different in their end goals
>> (runtime systems vs. libraries), touching code that is tightly coupled
>> to either Kudu or Impala's runtimes is probably not worth discussing.
>> However, I think there is a strong basis for collaboration on
>> computational utilities and vectorized array processing. Some obvious
>> areas that come to mind:
>>
>> * SIMD utilities (for hashing or processing of preallocated contiguous
>> memory)
>> * Array encoding utilities: RLE / Dictionary, etc.
>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> contributed a patch to parquet-cpp around this)
>> * Date and time utilities
>> * Compression utilities
>>
>
> Between Kudu and Impala (at least) there are many more opportunities for
> sharing. Threads, logging, metrics, concurrent primitives - the list is
> quite long.
>
>
>>
>> I hope the benefits are obvious: consolidating efforts on unit
>> testing, benchmarking, performance optimizations, continuous
>> integration, and platform compatibility.
>>
>> Logistically speaking, one possible avenue might be to use Apache
>> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> small, and it builds and installs fast. It is intended as a library to
>> have its headers used and linked against other applications. (As an
>> aside, I'm very interested in building optional support for Arrow
>> columnar messages into the kudu client).
>>
>
> In principle I'm in favour of code sharing, and it seems very much in
> keeping with the Apache way. However, practically speaking I'm of the
> opinion that it only makes sense to house shared support code in a
> separate, dedicated project.
>
> Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> of sharing to utilities that Arrow is interested in. It would make no sense
> to add a threading library to Arrow if it was never used natively. Muddying
> the waters of the project's charter seems likely to lead to user, and
> developer, confusion. Similarly, we should not necessarily couple Arrow's
> design goals to those it inherits from Kudu and Impala's source code.
>
> I think I'd rather see a new Apache project than re-use a current one for
> two independent purposes.
>
>
>>
>> The downside of code sharing, which may have prevented it so far, are
>> the logistics of coordinating ASF release cycles and keeping build
>> toolchains in sync. It's taken us the past year to stabilize the
>> design of Arrow for its intended use cases, so at this point if we
>> went down this road I would be OK with helping the community commit to
>> a regular release cadence that would be faster than Impala, Kudu, and
>> Parquet's respective release cadences. Since members of the Kudu and
>> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> collaborate to each other's mutual benefit and success.
>>
>> Note that Arrow does not throw C++ exceptions and similarly follows
>> Google C++ style guide to the same extent at Kudu and Impala.
>>
>> If this is something that either the Kudu or Impala communities would
>> like to pursue in earnest, I would be happy to work with you on next
>> steps. I would suggest that we start with something small so that we
>> could address the necessary build toolchain changes, and develop a
>> workflow for moving around code and tests, a protocol for code reviews
>> (e.g. Gerrit), and coordinating ASF releases.
>>
>
> I think, if I'm reading this correctly, that you're assuming integration
> with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> their toolchains. For something as fast moving as utility code - and
> critical, where you want the latency between adding a fix and including it
> in your build to be ~0 - that's a non-starter to me, at least with how the
> toolchains are currently realised.
>
> I'd rather have the source code directly imported into Impala's tree -
> whether by git submodule or other mechanism. That way the coupling is
> looser, and we can move more quickly. I think that's important to other
> projects as well.
>
> Henry
>
>
>
>>
>> Let me know what you think.
>>
>> best
>> Wes
>>

Mime
View raw message