arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leif Walsh <leif.wa...@gmail.com>
Subject Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}
Date Mon, 27 Feb 2017 03:18:16 GMT
I also support the idea of creating an "apache commons modern c++" style
library, maybe tailored toward the needs of columnar data processing
tools.  I think APR is the wrong project but I think that *style* of
project is the right direction to aim.

I agree this adds test and release process complexity across products but I
think the benefits of a shared, well-tested library outweigh that, and
creating such test infrastructure will have long-term benefits as well.

I'd be happy to lend a hand wherever it's needed.

On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <todd@cloudera.com> wrote:

> Hey folks,
>
> As Henry mentioned, Impala is starting to share more code with Kudu (most
> notably our RPC system, but that pulls in a fair bit of utility code as
> well), so we've been chatting periodically offline about the best way to do
> this. Having more projects potentially interested in collaborating is
> definitely welcome, though I think does also increase the complexity of
> whatever solution we come up with.
>
> I think the potential benefits of collaboration are fairly self-evident, so
> I'll focus on my concerns here, which somewhat echo Henry's.
>
> 1) Open source release model
>
> The ASF is very much against having projects which do not do releases. So,
> if we were to create some new ASF project to hold this code, we'd be
> expected to do frequent releases thereof. Wes volunteered above to lead
> frequent releases, but we actually need at least 3 PMC members to vote on
> each release, and given people can come and go, we'd probably need at least
> 5-8 people who are actively committed to helping with the release process
> of this "commons" project.
>
> Unlike our existing projects, which seem to release every 2-3 months, if
> that, I think this one would have to release _much_ more frequently, if we
> expect downstream projects to depend on released versions rather than just
> pulling in some recent (or even trunk) git hash. Since the ASF requires the
> normal voting period and process for every release, I don't think we could
> do something like have "daily automatic releases", etc.
>
> We could probably campaign the ASF membership to treat this project
> differently, either as (a) a repository of code that never releases, in
> which case the "downstream" projects are responsible for vetting IP, etc,
> as part of their own release processes, or (b) a project which does
> automatic releases voted upon by robots. I'm guessing that (a) is more
> palatable from an IP perspective, and also from the perspective of the
> downstream projects.
>
>
> 2) Governance/review model
>
> The more projects there are sharing this common code, the more difficult it
> is to know whether a change would break something, or even whether a change
> is considered desirable for all of the projects. I don't want to get into
> some world where any change to a central library requires a multi-week
> proposal/design-doc/review across 3+ different groups of committers, all of
> whom may have different near-term priorities. On the other hand, it would
> be pretty frustrating if the week before we're trying to cut a Kudu release
> branch, someone in another community decides to make a potentially
> destabilizing change to the RPC library.
>
>
> 3) Pre-commit/test mechanics
>
> Semi-related to the above: we currently feel pretty confident when we make
> a change to a central library like kudu/util/thread.cc that nothing broke
> because we run the full suite of Kudu tests. Of course the central
> libraries have some unit test coverage, but I wouldn't be confident with
> any sort of model where shared code can change without verification by a
> larger suite of tests.
>
> On the other hand, I also don't want to move to a model where any change to
> shared code requires a 6+-hour precommit spanning several projects, each of
> which may have its own set of potentially-flaky pre-commit tests, etc. I
> can imagine that if an Arrow developer made some change to "thread.cc" and
> saw that TabletServerStressTest failed their precommit, they'd have no idea
> how to triage it, etc. That could be a strong disincentive to continued
> innovation in these areas of common code, which we'll need a good way to
> avoid.
>
> I think some of the above could be ameliorated with really good
> infrastructure -- eg on a test failure, automatically re-run the failed
> test on both pre-patch and post-patch, do a t-test to check statistical
> significance in flakiness level, etc. But, that's a lot of infrastructure
> that doesn't currently exist.
>
>
> 4) Integration mechanics for breaking changes
>
> Currently these common libraries are treated as components of monolithic
> projects. That means it's no extra overhead for us to make some kind of
> change which breaks an API in src/kudu/util/ and at the same time updates
> all call sites. The internal libraries have no semblance of API
> compatibility guarantees, etc, and adding one is not without cost.
>
> Before sharing code, we should figure out how exactly we'll manage the
> cases where we want to make some change in a common library that breaks an
> API used by other projects, given there's no way to make an atomic commit
> across many repositories. One option is that each "user" of the libraries
> manually "rolls" to new versions when they feel like it, but there's still
> now a case where a common change "pushes work onto" the consumers to update
> call sites, etc.
>
> Admittedly, the number of breaking API changes in these common libraries is
> relatively small, but would still be good to understand how we would plan
> to manage them.
>
> -Todd
>
> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <wesmckinn@gmail.com>
> wrote:
>
> > hi Henry,
> >
> > Thank you for these comments.
> >
> > I think having a kind of "Apache Commons for [Modern] C++" would be an
> > ideal (though perhaps initially more labor intensive) solution.
> > There's code in Arrow that I would move into this project if it
> > existed. I am happy to help make this happen if there is interest from
> > the Kudu and Impala communities. I am not sure logistically what would
> > be the most expedient way to establish the project, whether as an ASF
> > Incubator project or possibly as a new TLP that could be created by
> > spinning IP out of Apache Kudu.
> >
> > I'm interested to hear the opinions of others, and possible next steps.
> >
> > Thanks
> > Wes
> >
> > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <henry@apache.org>
> wrote:
> > > Thanks for bringing this up, Wes.
> > >
> > > On 25 February 2017 at 14:18, Wes McKinney <wesmckinn@gmail.com>
> wrote:
> > >
> > >> Dear Apache Kudu and Apache Impala (incubating) communities,
> > >>
> > >> (I'm not sure the best way to have a cross-list discussion, so I
> > >> apologize if this does not work well)
> > >>
> > >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> > >> between the codebases in Apache Arrow and Apache Parquet, and
> > >> opportunities for more code sharing with Kudu and Impala as well.
> > >>
> > >> As context
> > >>
> > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> > >> first C++ release within Apache Parquet. I got involved with this
> > >> project a little over a year ago and was faced with the unpleasant
> > >> decision to copy and paste a significant amount of code out of
> > >> Impala's codebase to bootstrap the project.
> > >>
> > >> * In parallel, we begin the Apache Arrow project, which is designed to
> > >> be a complementary library for file formats (like Parquet), storage
> > >> engines (like Kudu), and compute engines (like Impala and pandas).
> > >>
> > >> * As Arrow and parquet-cpp matured, an increasing amount of code
> > >> overlap crept up surrounding buffer memory management and IO
> > >> interface. We recently decided in PARQUET-818
> > >> (https://github.com/apache/parquet-cpp/commit/
> > >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> > >> to remove some of the obvious code overlap in Parquet and make
> > >> libarrow.a/so a hard compile and link-time dependency for
> > >> libparquet.a/so.
> > >>
> > >> * There is still quite a bit of code in parquet-cpp that would better
> > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> > >> compression, bit utilities, and so forth. Much of this code originated
> > >> from Impala
> > >>
> > >> This brings me to a next set of points:
> > >>
> > >> * parquet-cpp contains quite a bit of code that was extracted from
> > >> Impala. This is mostly self-contained in
> > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> > >>
> > >> * My understanding is that Kudu extracted certain computational
> > >> utilities from Impala in its early days, but these tools have likely
> > >> diverged as the needs of the projects have evolved.
> > >>
> > >> Since all of these projects are quite different in their end goals
> > >> (runtime systems vs. libraries), touching code that is tightly coupled
> > >> to either Kudu or Impala's runtimes is probably not worth discussing.
> > >> However, I think there is a strong basis for collaboration on
> > >> computational utilities and vectorized array processing. Some obvious
> > >> areas that come to mind:
> > >>
> > >> * SIMD utilities (for hashing or processing of preallocated contiguous
> > >> memory)
> > >> * Array encoding utilities: RLE / Dictionary, etc.
> > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> > >> contributed a patch to parquet-cpp around this)
> > >> * Date and time utilities
> > >> * Compression utilities
> > >>
> > >
> > > Between Kudu and Impala (at least) there are many more opportunities
> for
> > > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > > quite long.
> > >
> > >
> > >>
> > >> I hope the benefits are obvious: consolidating efforts on unit
> > >> testing, benchmarking, performance optimizations, continuous
> > >> integration, and platform compatibility.
> > >>
> > >> Logistically speaking, one possible avenue might be to use Apache
> > >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> > >> small, and it builds and installs fast. It is intended as a library to
> > >> have its headers used and linked against other applications. (As an
> > >> aside, I'm very interested in building optional support for Arrow
> > >> columnar messages into the kudu client).
> > >>
> > >
> > > In principle I'm in favour of code sharing, and it seems very much in
> > > keeping with the Apache way. However, practically speaking I'm of the
> > > opinion that it only makes sense to house shared support code in a
> > > separate, dedicated project.
> > >
> > > Embedding the shared libraries in, e.g., Arrow naturally limits the
> scope
> > > of sharing to utilities that Arrow is interested in. It would make no
> > sense
> > > to add a threading library to Arrow if it was never used natively.
> > Muddying
> > > the waters of the project's charter seems likely to lead to user, and
> > > developer, confusion. Similarly, we should not necessarily couple
> Arrow's
> > > design goals to those it inherits from Kudu and Impala's source code.
> > >
> > > I think I'd rather see a new Apache project than re-use a current one
> for
> > > two independent purposes.
> > >
> > >
> > >>
> > >> The downside of code sharing, which may have prevented it so far, are
> > >> the logistics of coordinating ASF release cycles and keeping build
> > >> toolchains in sync. It's taken us the past year to stabilize the
> > >> design of Arrow for its intended use cases, so at this point if we
> > >> went down this road I would be OK with helping the community commit to
> > >> a regular release cadence that would be faster than Impala, Kudu, and
> > >> Parquet's respective release cadences. Since members of the Kudu and
> > >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> > >> collaborate to each other's mutual benefit and success.
> > >>
> > >> Note that Arrow does not throw C++ exceptions and similarly follows
> > >> Google C++ style guide to the same extent at Kudu and Impala.
> > >>
> > >> If this is something that either the Kudu or Impala communities would
> > >> like to pursue in earnest, I would be happy to work with you on next
> > >> steps. I would suggest that we start with something small so that we
> > >> could address the necessary build toolchain changes, and develop a
> > >> workflow for moving around code and tests, a protocol for code reviews
> > >> (e.g. Gerrit), and coordinating ASF releases.
> > >>
> > >
> > > I think, if I'm reading this correctly, that you're assuming
> integration
> > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > > their toolchains. For something as fast moving as utility code - and
> > > critical, where you want the latency between adding a fix and including
> > it
> > > in your build to be ~0 - that's a non-starter to me, at least with how
> > the
> > > toolchains are currently realised.
> > >
> > > I'd rather have the source code directly imported into Impala's tree -
> > > whether by git submodule or other mechanism. That way the coupling is
> > > looser, and we can move more quickly. I think that's important to other
> > > projects as well.
> > >
> > > Henry
> > >
> > >
> > >
> > >>
> > >> Let me know what you think.
> > >>
> > >> best
> > >> Wes
> > >>
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
-- 
-- 
Cheers,
Leif

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message