hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "TestingNov2009" by SteveLoughran
Date Fri, 20 Nov 2009 14:46:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "TestingNov2009" page has been changed by SteveLoughran.
The comment on this change is: more on testing.
http://wiki.apache.org/hadoop/TestingNov2009?action=diff&rev1=1&rev2=2

--------------------------------------------------

  
  == Benchmarking ==
  
- One use case that comes up is stress testing clusters; to see the cluster supports Hadoop
"as well as it should", and trying to find out why it doesn't, if it is not adequate. What
we have today is [[Terasort]], where you have to guess the approximate numbers then run the
job. Terasort creates its own test data, which is good, but it doesn't stress the CPUs as
realistically as many workloads, and it generates lots of intermediate and final data; there
is no reduction.
+ One use case that comes up is stress testing clusters; to see the cluster supports Hadoop
"as well as it should", and trying to find out why it doesn't, if it is not adequate. What
we have today is TeraSort, where you have to guess the approximate numbers then run the job.
TeraSort creates its own test data, which is good, but it doesn't stress the CPUs as realistically
as many workloads, and it generates lots of intermediate and final data; there is no reduction.
  
   * [[http://www.slideshare.net/steve_l/benchmarking-1840029 | Benchmarking slides]]
  
  == Basic Cluster Health Tests ==
  
- There are currently no tests that work with Hadoop via the web pages, no job submission
and monitoring. It is in fact possible to bring up a Hadoop cluster in which JSP doesn't work,
but the basic tests all appear well -even including TeraSort, provided you use the low-level
APIs
+ There are currently no tests that work with Hadoop via the web pages, no job submission
and monitoring. It is in fact possible to bring up a Hadoop cluster in which JSP doesn't work,
but the basic tests all appear well -even including TeraSort, provided you use the low-level
APIs.
+ 
+ Options
+  * Create a set of JUnit/HtmlUnit tests that test the GUI; design these to run against any
host. Either check out the source tree and run the against a remote cluster, or package the
tests in a JAR and make this a project distributable. 
+ * We may need separate test JARs for HDFS and mapreduce.
+ 
  
  == Testing underlying platforms ==
  
  We need to test the underlying platforms, from the JVM and Linux distributions to any Infrastructure-on-Demand
APIs that provide VMs on demand, machines which can run Hadoop.
  
+ === JVM Testing ===
+ 
+ An IBM need; can also be used to qualify new Sun releases. Any JVM defect which stops Hadoop
running at scale should be viewed as a blocking issue by all JVM suppliers.
+ 
+  * Need to be able to install latest JVM build, run the stress tests. 
+ 
+ === OS Testing ===
+ 
+ Test Hadoop working on the target OS. If Hadoop is packaged in an OS specific format (e.g.
RPM), those installations need to be tested. 
+ 
+  * Need to be able to create new machine images (PXE, kickstart, etc.), then push out Hadoop
to the nodes and test the cluster.
+ 
+ === IaaS Testing ===
+ 
+ Hadoop can be used to stress test Infrastructure as a Service platforms, and is offered
as a service by some companies (Cloudera, EC2).
+ 
+ Hadoop can be used on Eucalyptus installations using EC2 client libraries. This can show
up problems with Eucalyptus (different fault messages compared EC2, time zone/clock differences.

+ 
+ Other infrastructures will have different APIs, with different features (private subnets,
machine restart and persistence)
+ 
+  * Need to be able to work with different infrastructures and unstable APIs. 
+  * Testing on EC2 runs up rapid bills if you create/destroy machines every junit test method,
or even every test run. Best to create a small pool of machines at the start of the working
day, release them in the evening. And to have build file targets to destroy all of a developer's
machines -and to run it at night as part of the CI build.
+  * Troubleshooting on IaaS platforms can be interesting as the VMs get destroyed -the test
runner needs to capture (relevant) local log data.
+  * SSH is the primary way to communicate with the (long-haul) cluster, even from a developer's
local machine.
+  * Important not to embed private data -keys, logins, in build files, test reports-
+  * For testing local Hadoop builds on IaaS platforms, the build process needs to scp over
and install the Hadoop binaries and the configuration files. This can be done by creating
a new disk image that is then used to bootstrap every node, or you start with a base clean
image and copy in Hadoop on demand. The latter is much more agile and cost effective during
iterative development, but doesn't scale to very-large clusters (1000s of machines), unless
you delegate the task of copy/install to the first few tens of allocated machines. For EC2,
one tactic is to upload the binaries to S3, and have scripts on the nodes to copy down and
install the files.
+ 
+ 
  == Exploring the Hadoop Configuration Space ==
  
- There are a lot of Hadoop configuration options, even ignoring those of the underlying machines
and network.
+ There are a lot of Hadoop configuration options, even ignoring those of the underlying machines
and network. For example, what impact does blocksize and replication factor have on your workload.
+ 
  
  == Testing applications that run on Hadoop ==
  
@@ -43, +77 @@

  
  This is a problem which Cloudera and others who distribute/internally package and deploy
Hadoop have: you need to know that your RPMs or other redistributables work.
  
- It's similar to the cluster acceptance test problem, except that you need to create the
distribution packages and install them on the remote machines, then run the tests.
+ It's similar to the cluster acceptance test problem, except that you need to create the
distribution packages and install them on the remote machines, then run the tests. The testing-over-IaaS
platforms use cases are closer. 
  
+  * Testing RPM upgrades from many past versions is tricky.
+ 
+ == Simulating Cluster Failures ==
+ 
+ Cluster failure handling -especially the loss of large portions of a large datacenter, is
something that is not currently formally tested. 
+ 
+  * Network failures can be simulated on some IaaS platforms
+  * Forcibly killing processes is a more realistic approach which works on most platforms,
though it is hard to choreograph
+ 

Mime
View raw message