impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Robinson <he...@cloudera.com>
Subject Re: Getting rid of thirdparty
Date Thu, 10 Mar 2016 19:03:41 GMT
" the upstream ASF repo should use ASF-released artifacts here"

While there's precedent elsewhere in the ASF for depending on downstream
vendor-specific artifacts, I feel pretty strongly that there should be a
clean separation between the ASF and downstream dependencies.

I take your point about the flexibility of choosing which toolchain
dependencies to take. Might be a good follow-on step to allow that
(TOOLCHAIN_MODE={ALL, COMPILE, TEST}) or something, but we can wait to see
if this is needed by the community.

On 10 March 2016 at 10:59, Matthew Jacobs <mj@cloudera.com> wrote:

> Thanks for outlining these options. How does native-toolchain factor into
> our ASF story? I.e. do we need it to be less Cloudera-project-oriented, or
> is it OK for it to contain CDH (rather than Apache Hadoop) deployments? If
> we're considering it to be more Cloudera-focused, it seems like it could
> make upstream contributions difficult as there wouldn't really be a non-CDH
> build/runtime toolchain. I guess upstream contributors could fork our
> toolchain (or start their own) and replace the CDH components? If we detach
> the compile-time dependencies and the test runtime projects, it would
> probably make things easier for the rest of the world as they could easily
> take the native-toolchain, the test environment, or both.
>
> On Thu, Mar 10, 2016 at 10:38 AM Henry Robinson <henry@apache.org> wrote:
>
> > One of the tasks remaining before we can push Impala's code to the ASF's
> > git instance is to reduce the size of the repository. Right now even a
> > checkout of origin/cdh5-trunk is in the multi-GB range.
> >
> > The vast majority of that is in the thirdparty/ directory, which adds up
> > over the git history to be pretty huge with all the various versions
> we've
> > checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we get
> > rid of thirdparty/ altogether.
> >
> > There are two main dependency types in thirdparty/. The first is a
> > compile-time C++ dependency like open-ldap or avro-c. These are (almost)
> > all superseded by the toolchain (see
> > https://github.com/cloudera/native-toolchain) build. A couple of
> > exceptions
> > are Squeasel and Mustache which don't produce their own libraries but are
> > source files directly included in the Impala build. I don't see a good
> > reason we couldn't move those to the toolchain as well.
> >
> > The other kind of dependency are the test binaries that are used when we
> > start Impala's test environment (i.e. start the Hive metastore, HBase,
> etc,
> > etc.). These are trickier to extract (they're not just JARs, but
> bin/hadoop
> > etc. etc.). We also need to be able to change these dependencies pretty
> > efficiently - the upstream ASF repo should use ASF-released artifacts
> here,
> > but downstream vendors (like Cloudera) will want to replace the ASF
> > artifacts with their own releases.
> >
> > Note that the Java binaries in thirdparty/ are *not* the compile-time
> > dependencies for Impala's Java frontend - those are resolved via Maven.
> > It's a bad thing that there's two dependency resolution mechanisms, but
> we
> > might not be able to solve that issue right now.
> >
> > So what should we do with the test dependencies? I see the following
> > options:
> >
> > 1. Put them in the native-toolchain repository. *Pros:* (almost) all
> > dependency resolution comes from one place. *Cons:* native-toolchain
> would
> > change very frequently as new releases happen.
> >
> > 2. Don't provide any built-in mechanism for starting a test environment.
> If
> > you want to test Impala - set up your own Hadoop cluster instance.
> > *Pros:* removes
> > a lot of complexity *Cons: *pushes a lot of work onto the user, makes it
> > harder to run self-contained tests.
> >
> > 3. Have a separate test-dependencies repository that does basically the
> > same thing as the toolchain. *Pros:* separates out fast-moving
> dependencies
> > from slow-moving ones *Cons:* more moving parts. HDFS would need to be in
> > both repositories (as libhdfs is a compile-time dependency for the
> > backend).
> >
> > My preference is for option #1. We can do something like the following:
> >
> > * Add a cmake target to 'build' a test environment (resolve test
> > dependencies, start mini-cluster using checked-in scripts)
> > * Add scripts to native-toolchain to download tarballs for HBase, HDFS,
> > Hive and others just like compile-time dependencies. Update Impala's
> CMake
> > scripts to use those the local toolchain directory to find binaries,
> > management scripts etc.
> > * During each upstream release, add any new dependencies to
> > native-toolchain, and update impala.git/bin/impala-config.sh with the new
> > version numbers.
> >
> > What does everyone think?
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message