impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Robinson <he...@cloudera.com>
Subject Re: Getting rid of thirdparty
Date Thu, 10 Mar 2016 19:10:09 GMT
I didn't think that binaries were uploaded to any repository, but instead
to S3 (and therefore there's no version history) or some other URL. That's
what I'd suggest we continue to do.

Cloudera and the Apache Impala project should do what's best for them,
independently. I bet Cloudera can fork the native-toolchain repository and
set the dependency versions as desired. Then the dependencies can be
uploaded to a Cloudera-specific location.

Maven would also be an ok route to explore - are start / stop scripts etc.
routinely checked into Maven by other projects? The nice thing about the
toolchain is that we can usually rely on a longer lifetime for published
artifacts (in my experience, dependencies can come and go with Maven).

On 10 March 2016 at 11:03, Casey Ching <casey@cloudera.com> wrote:

> I suspect we can actually run all the test services using the maven
> artifacts. Maybe we can investigate that?
>
> There’s not enough information about #1. How do updates work? The nice
> thing about the current setup is anyone can checkout any commit and there’s
> a decent chance that checkout will build. Are we going to keep that
> ability? How does this work for Cloudera, Apache, and others, are we going
> to upload all test binaries to the same repo?
>
>
> On March 10, 2016 at 10:52:07 AM, Jim Apple (jbapple@cloudera.com) wrote:
> Both #1 and #3 seem reasonable to me. I think #2 should be avoided because
> the Con you listed will, I think, make contributing to Impala difficult
> for
> new contributors, and I think that's more serious than the Cons for #1 and
> #3.
>
> On Thu, Mar 10, 2016 at 10:38 AM, Henry Robinson <henry@apache.org>
> wrote:
>
> > One of the tasks remaining before we can push Impala's code to the ASF's
> > git instance is to reduce the size of the repository. Right now even a
> > checkout of origin/cdh5-trunk is in the multi-GB range.
> >
> > The vast majority of that is in the thirdparty/ directory, which adds up
> > over the git history to be pretty huge with all the various versions
> we've
> > checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we
> get
> > rid of thirdparty/ altogether.
> >
> > There are two main dependency types in thirdparty/. The first is a
> > compile-time C++ dependency like open-ldap or avro-c. These are (almost)
> > all superseded by the toolchain (see
> > https://github.com/cloudera/native-toolchain) build. A couple of
> > exceptions
> > are Squeasel and Mustache which don't produce their own libraries but
> are
> > source files directly included in the Impala build. I don't see a good
> > reason we couldn't move those to the toolchain as well.
> >
> > The other kind of dependency are the test binaries that are used when we
> > start Impala's test environment (i.e. start the Hive metastore, HBase,
> etc,
> > etc.). These are trickier to extract (they're not just JARs, but
> bin/hadoop
> > etc. etc.). We also need to be able to change these dependencies pretty
> > efficiently - the upstream ASF repo should use ASF-released artifacts
> here,
> > but downstream vendors (like Cloudera) will want to replace the ASF
> > artifacts with their own releases.
> >
> > Note that the Java binaries in thirdparty/ are *not* the compile-time
> > dependencies for Impala's Java frontend - those are resolved via Maven.
> > It's a bad thing that there's two dependency resolution mechanisms, but
> we
> > might not be able to solve that issue right now.
>
> >
> > So what should we do with the test dependencies? I see the following
> > options:
> >
> > 1. Put them in the native-toolchain repository. *Pros:* (almost) all
> > dependency resolution comes from one place. *Cons:* native-toolchain
> would
> > change very frequently as new releases happen.
> >
> > 2. Don't provide any built-in mechanism for starting a test environment.
> If
> > you want to test Impala - set up your own Hadoop cluster instance.
> > *Pros:* removes
> > a lot of complexity *Cons: *pushes a lot of work onto the user, makes it
> > harder to run self-contained tests.
> >
> > 3. Have a separate test-dependencies repository that does basically the
> > same thing as the toolchain. *Pros:* separates out fast-moving
> dependencies
> > from slow-moving ones *Cons:* more moving parts. HDFS would need to be
> in
> > both repositories (as libhdfs is a compile-time dependency for the
> > backend).
> >
> > My preference is for option #1. We can do something like the following:
> >
> > * Add a cmake target to 'build' a test environment (resolve test
> > dependencies, start mini-cluster using checked-in scripts)
> > * Add scripts to native-toolchain to download tarballs for HBase, HDFS,
> > Hive and others just like compile-time dependencies. Update Impala's
> CMake
> > scripts to use those the local toolchain directory to find binaries,
> > management scripts etc.
> > * During each upstream release, add any new dependencies to
> > native-toolchain, and update impala.git/bin/impala-config.sh with the
> new
> > version numbers.
> >
> > What does everyone think?
> >
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message