hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: additional source only release tarball
Date Thu, 25 Feb 2010 13:20:41 GMT
Arun C Murthy wrote:
>  This seems reasonable, anyone else?
>  Thomas, could you, please, open a jira and attach your patch to 
> build.xml there? I have some comments on some of your exclude patterns, 
> but we can have that discussion on the jira.
> thanks,
> Arun

One thing I've discussed before is the whole notion of having some 
hadoop-redist project whose aim is to produce useful 
packaging/installation of the artifacts of the main Hadoop projects.

  -stuff to help the various CM tools work w/ Hadoop
  -things to help is deploy and test at scale

This may seem trivial, but it isn't. Right now my test process is
  * create my own RPMs
  * scp them to a machine with some scripts (that I don't run) to
  * ssh to that machine, run some scripts and create a VM disk image 
containing the RPMs and the installer scripts
  * bring up the controller front end to the infrastructure

Then I run a junit test
  * Have htmlunit start talking to a web ui
  * through it, ask for 20 VMs
  * go get my 16th coffee of the day, assuming this is test run #14 
(first two coffees are at home, one before the school run, the other 
while doing the first round of emails)
  * have the controller wait for the VMs to come up
  * as they do, for each one, run a thread that SSHs in  and actually 
pushes out the Hadoop config based on the hostnames given to the 
different VMs
  * run terasort
  * grab the logs
  * teardown
  * merge the logs in with the rest of the junit output
  * try and work out why the test failed

oh, and everything from the 20 vms onwards is started by junit test

To make life more fun, that bit of [1] has to spin waiting for the DNS 
names to resolve, the machines to boot enough for SSH to work, then SSH 
in and spin there waiting for the installed programs to be on the path 
and then the CM tooling to be live. And feed the log of this not just to 
the local log4j log but (somehow) stream it back over the (remote) web 
ui that has to then live update the web page (this is surprisingly hard)

Here's a bit of it, excluding the coffees

           PortUtils.checkPort(hostname, factory.getPort(), 

	//some stuff creating SSH sessions, SCP-ing over files skipped

             //make a few attempts to find the startup command
             for (int i = 0; i < STARTUP_LOCATE_ATTEMPTS; i++) {
                 commandsList.add("which " + SF_START + " || sleep " + 
             commandsList.add("which " + SF_START + " || echo " + 
             String sfPing = makeSFCommand(SF_PING) + " " + "localhost";
             for (int i = 0; i < STARTUP_PING_ATTEMPTS; i++) {
                 commandsList.add(sfPing + " || sleep " + 
             commandsList.add("sleep " + SLEEP_TIME);
             if (!factory.isKeepFiles()) {
                 commandsList.add("rm " + desttempfile);
             sshExec(session, commandsList);

See that? fun. JUnit test cases that fail in interesting ways. Really 
interesting ways. Sometimes it's the infrastructure, sometimes it's my 
code. Sometimes, because I am working with SVN_HEAD, it's Hadoop. 
Problem is, it takes a while to track down, especially if the VMs are no 
longer there at the end of the run.

Right now one of the problems is that none of Hadoop's JSP pages work. I 
can submit jobs, talk to the FS, etc, but all the JSP pages are failing 
and the root of every web is /webapps ; I think it's something that 
Todd's also seeing, but I can' t make it go away, not without fixing how 
Hadoop locates/registers webapps.

Returning to the idea of a deployment project, having more tests that 
can be run against production clusters would be good: all the tests 
should be redistributables that you can deploy/run. There is something 
needed to QA a cluster, and ideally identify the approximate area where 
problems lie (e.g. DNS playing up, rDNS not working, clocks out of sync, 



View raw message