hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nauroth <cnaur...@hortonworks.com>
Subject Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Date Wed, 21 Nov 2012 21:03:23 GMT
I worked on some of the Python build scripting that currently resides in
branch-trunk-win.  Initially, my goal was to keep a "pure" Maven
implementation to the greatest degree possible without external scripting,
but I encountered a few problems:

1. One approach is to try to express all of the build logic with existing
Maven plugins.  This turned out to be infeasible in some cases.  I don't
know of an existing plugin that does anything like the logic in
saveVersion.sh/.py for walking the source tree and checksumming the files.
 For protoc, I saw a proposed plugin in open source, but it hadn't reached
release status yet.  For creation of the distribution tarballs, the Maven
Ant Plugin (and actually the underlying Ant tool) cannot preserve file
permissions or symlinks.

2. Considering that the first approach isn't possible, another possibility
is to write custom Maven plugins.  This would require significantly more
engineering time to write and test the code.  I think there are some
legitimate concerns too about supportability, because this approach would
put significant build logic into Maven plugin code instead of something
more easily visible to release engineers, like pom.xml and external
scripts.  Also, I'm actually not sure that we can implement everything with
a Maven plugin.  For example, I mentioned the problem of preserving file
permissions and symlinks in the distribution tarballs.  Ant hasn't been
able to fix that problem due to a Java limitation, so our Maven plugins
coded in Java (or another JVM language) likely would suffer the same fate.
 We might be stuck with some amount of external scripting no matter what.

Thank you,
--Chris


On Wed, Nov 21, 2012 at 12:00 PM, Konstantin Boudnik <cos@apache.org> wrote:

> I like Alejandro's idea about Maven for a few of reasons:
>   - bringing in a scripting environment which is known for its
> inter-version
>     idiosyncrasies just because Windows can't handle trivial shell
> scripting
>     looks like an overkill to me
>   - relative to above, there's a chance that Python's pre-requisites used
> in
>     Hadoop might get into a conflict with some other components in the
> stack.
>     This will be a nightmare for the integrator projects i.e. Bigtop
>   - Maven is de-facto standard for Java stacks
>   - Maven has built-in scripting language (Groovy) if some plugins aren't
>     sufficient for achieving whatever goals
>
> Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses
> Maven
> stuff suchs as deploy/install via custom ant tasks. Same approach would
> work
> for saveVersion.sh and others, I am sure.
>
> Cos
>
> On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote:
> > Hey Matt,
> >
> > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on
> > its way out with the move of docs to APT)
> >
> > Why not do a maven-plugin to do that?
> >
> > Colin already has something to simplify all the cmake calls from the
> builds
> > using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)
> >
> > We could do the same with protoc, thus simplifying the POMs.
> >
> > The saveVersion.sh seems like another prime candidate for a maven plugin,
> > and in this case it would not require external tools.
> >
> > Does this make sense?
> >
> > Thx
> >
> > On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <mattf@apache.org> wrote:
> >
> > > This discussion started in
> > > HADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>
> > > , where it was proposed to replace the build-time utility
> "saveVersion.sh"
> > > with a python script.  This would require Python as a build-time
> > > dependency.  Here's the background:
> > >
> > > Those of us involved in the branch-1-win port of Hadoop to Windows
> without
> > > use of Cygwin, have faced the issue of frequent use of shell scripts
> > > throughout the system, both in build time (eg, the utility
> > > "saveVersion.sh"),
> > > and run time (config files like "hadoop-env.sh" and the start/stop
> scripts
> > > in "bin/*" ).  Similar usages exist throughout the Hadoop stack, in all
> > > projects.
> > >
> > > The vast majority of these shell scripts do not do anything platform
> > > specific; they can be expressed in a posix-conforming way.  Therefore,
> it
> > > seems to us that it makes sense to start using a cross-platform
> scripting
> > > language, such as python, in place of shell for these purposes.  For
> those
> > > rare occasions where platform-specific functionality really is needed,
> > > python also supports quite a lot of platform-specific functionality on
> both
> > > Linux and Windows; but where that is inadequate, one could still
> > > conditionally invoke a platform-specific module written in shell (for
> > > Linux/*nix) or powershell or bat (for Windows).
> > >
> > > The primary motive for moving to a cross-platform scripting language is
> > > maintainability.  The alternative would be to maintain two complete
> suites
> > > of scripts, one for Linux and one for Windows (and perhaps others in
> the
> > > future).  We want to avoid the need to update dual modules in two
> different
> > > languages when functionality changes, especially given that many Linux
> > > developers are not familiar with powershell or bat, and many Windows
> > > developers are not familiar with shell or bash.
> > >
> > > Regarding the choice of python:
> > >
> > >    - There are already a few instances of python usage in Hadoop, such
> as
> > >    the utility (currently broken) "relnotes.py", and massive usage of
> > > python
> > >    in the examples/ and contrib/ directories.
> > >    - Python is also used in Bigtop build-time.
> > >    - The Python language is available for free on essentially all
> > >    platforms, under an Apache-compatible
> > > license<http://www.apache.org/legal/resolved.html>.
> > >
> > >    - It is supported in Eclipse and similar IDEs.
> > >    - Most importantly, it is widely accepted as a reasonably good OO
> > >    scripting language, and it is easily learned by anyone who already
> knows
> > >    shell or perl, or other common scripting languages.
> > >    - On the Tiobe index of programming language
> > > popularity<
> > > http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html>,
> > >    which seeks to measure the relative number of software engineers who
> > > know
> > >    and use each language, Python far exceeds Perl and Ruby.  The only
> more
> > >    well-known scripting languages are PHP and Visual Basic, neither of
> > > which
> > >    seems a prime candidate for this use.
> > >
> > > For build-time usage, I think we should immediately approve python as a
> > > build-time dependency, and allow people who are motivated to do so, to
> open
> > > jiras for migrating existing build-time shell scripts to python.
> > >
> > > For run-time, there is likely to be a lot more discussion.  Lots of
> folks,
> > > including me, aren't real happy with use of active scripts for
> > > configuration, and various others, including I believe some of the
> Bigtop
> > > folks, have issues with the way the start/stop scripts work.
>  Nevertheless,
> > > all those scripts exist today and are widely used.  And they present an
> > > impediment to porting to Windows-without-cygwin.
> > >
> > > Nothing about run-time use of scripts has changed significantly over
> the
> > > past three years, and I don't think we should hold up the Windows port
> > > while we have a huge discussion about issues that veer dangerously into
> > > religious/aesthetic domains. It would be fun to have that discussion,
> but I
> > > don't want this decision to be dependent on it!
> > >
> > > So I propose that we go ahead and also approve python as a run-time
> > > dependency, and allow the inclusion of python scripts in place of
> current
> > > shell-based functionality.  The unpleasant alternative is to spawn a
> bunch
> > > of powershell scripts in parallel to the current shell scripts, with a
> very
> > > negative impact on maintainability.  The Windows port must, after all,
> be
> > > allowed to proceed.
> > >
> > > Let's have a discussion, and then I'll put both issues, separately, to
> a
> > > vote (unless we miraculously achieve consensus without a vote :-)
> > >
> > > I also encourage members of the other Hadoop-related projects, to carry
> > > this discussion into those forums.  It would be very cool to agree on a
> > > whole-stack solution for the scripting problem.
> > >
> > > Best regards,
> > > --Matt
> > >
> >
> >
> >
> > --
> > Alejandro
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message