hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Date Thu, 22 Nov 2012 09:14:19 GMT
On 21 November 2012 19:15, Matt Foley <mattf@apache.org> wrote:

> This discussion started in
> Those of us involved in the branch-1-win port of Hadoop to Windows without
> use of Cygwin, have faced the issue of frequent use of shell scripts
> throughout the system, both in build time (eg, the utility
> "saveVersion.sh"),
> and run time (config files like "hadoop-env.sh" and the start/stop scripts
> in "bin/*" ).  Similar usages exist throughout the Hadoop stack, in all
> projects.
> The vast majority of these shell scripts do not do anything platform
> specific; they can be expressed in a posix-conforming way.  Therefore, it
> seems to us that it makes sense to start using a cross-platform scripting
> language, such as python, in place of shell for these purposes.  For those
> rare occasions where platform-specific functionality really is needed,
> python also supports quite a lot of platform-specific functionality on both
> Linux and Windows; but where that is inadequate, one could still
> conditionally invoke a platform-specific module written in shell (for
> Linux/*nix) or powershell or bat (for Windows).
> The primary motive for moving to a cross-platform scripting language is
> maintainability.  The alternative would be to maintain two complete suites
> of scripts, one for Linux and one for Windows (and perhaps others in the
> future).  We want to avoid the need to update dual modules in two different
> languages when functionality changes, especially given that many Linux
> developers are not familiar with powershell or bat, and many Windows
> developers are not familiar with shell or bash.
I'd argue that a lot of Hadoop java developers aren't that familiar with
bash. It's only in the last six months that I've come to hate it properly.

In the ant project, it was the launcher scripts that had the worst
bugrep:line ratio, as
 -variations in .sh behaviour, especially under cygwin, but also things
that weren't bash (AIX, ...)
 -requirements of the entire unix command set for real work
 -variants in the parameters/behaviour of those commands between Linux and
other widely used Unix systems (e.g. OSX)
 -lack of inclusion of the .sh scripts in the junit test suite
 -lack of understanding of bash.

In the ant project we added a Python launcher in, what, 2001, based on the
Perl launcher supplied by one steve_l@users.sourceforge

> For run-time, there is likely to be a lot more discussion.  Lots of folks,
> including me, aren't real happy with use of active scripts for
> configuration, and various others, including I believe some of the Bigtop
> folks, have issues with the way the start/stop scripts work.  Nevertheless,
> all those scripts exist today and are widely used.  And they present an
> impediment to porting to Windows-without-cygwin.

They're a maintenance and support cost on Unix. Too many scripts, even more
in Yarn, weakly-nondeterministic logic for loading env variables,
especially between init.d and bin/hadoop; not much diagnostics. And as with
Ant, a relatively under-comprehended language with no unit test coverage.

I'd replace the bash logic with python for Unix dev and maintenance alone.
You could put your logic into a shared python module in usr/lib/hadoop/bin
, have PyUnit test the inner functions as part of the build and test
process (& jenkins).

> Nothing about run-time use of scripts has changed significantly over the
> past three years, and I don't think we should hold up the Windows port
> while we have a huge discussion about issues that veer dangerously into
> religious/aesthetic domains. It would be fun to have that discussion, but I
> don't want this decision to be dependent on it!
With Yarn its got more complex. More env variables to set, more support
calls when they aren't.

> So I propose that we go ahead and also approve python as a run-time
> dependency, and allow the inclusion of python scripts in place of current
> shell-based functionality.  The unpleasant alternative is to spawn a bunch
> of powershell scripts in parallel to the current shell scripts, with a very
> negative impact on maintainability.  The Windows port must, after all, be
> allowed to proceed.
+1 to any vote to allow .py at run time as a new feature

=0 to ripping out and replacing the existing .sh scripts with python code,
as even though I don't like the scripts, replacing them could be traumatic

+1 to a gradual migration to .py for new code, starting with the yarn

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message