hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "TestingNov2009" by SteveLoughran
Date Fri, 20 Nov 2009 16:43:46 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "TestingNov2009" page has been changed by SteveLoughran.
The comment on this change is: Extra issues and past work.
http://wiki.apache.org/hadoop/TestingNov2009?action=diff&rev1=3&rev2=4

--------------------------------------------------

  Test Hadoop working on the target OS. If Hadoop is packaged in an OS specific format (e.g.
RPM), those installations need to be tested. 
  
   * Need to be able to create new machine images (PXE, kickstart, etc.), then push out Hadoop
to the nodes and test the cluster.
+  * Cluster setup times can be significant if you have to reboot and re-image physical machines.
  
  === IaaS Testing ===
  
@@ -53, +54 @@

  Other infrastructures will have different APIs, with different features (private subnets,
machine restart and persistence)
  
   * Need to be able to work with different infrastructures and unstable APIs. 
+  * Machine Allocation/release becomes a big delay on every test case that creates new machines
   * Testing on EC2 runs up rapid bills if you create/destroy machines every junit test method,
or even every test run. Best to create a small pool of machines at the start of the working
day, release them in the evening. And to have build file targets to destroy all of a developer's
machines -and to run it at night as part of the CI build.
   * Troubleshooting on IaaS platforms can be interesting as the VMs get destroyed -the test
runner needs to capture (relevant) local log data.
   * SSH is the primary way to communicate with the (long-haul) cluster, even from a developer's
local machine.
@@ -63, +65 @@

  
  == Exploring the Hadoop Configuration Space ==
  
- There are a lot of Hadoop configuration options, even ignoring those of the underlying machines
and network. For example, what impact does blocksize and replication factor have on your workload.
+ There are a lot of Hadoop configuration options, even ignoring those of the underlying machines
and network. For example, what impact does blocksize and replication factor have on your workload?
What different network card configuration parameters give the best performance? Which combinations
of options break things?
  
+ When combined with IaaS platforms, the configuration space gets even larger.
+ 
+ Manually exploring the configuration space takes too long; currently everyone tries to stick
closed to the Yahoo! configurations which are believed to work -whenever someone strays off
it, interesting things happen. For example, setting a replication factor of only 2 found a
duplication bug; running Hadoop on a machine that isn't quite sure of its hostname shows up
other assumptions as things you can not rely on. 
+ 
+  * There is existing work on automated configuration testing, notably the work done by Adam
Porter and colleagues on [[http://www.youtube.com/watch?v=r0nn40O3mCY | Distributed Continuous
Quality Assurance]]
+  * (Steve says) in HP we've used a Pseudo-RNG to drive transforms to the infrastructure
and deployed applications, this explores some of the space and is somewhat replicable.
  
  == Testing applications that run on Hadoop ==
  
@@ -84, +92 @@

  
  == Simulating Cluster Failures ==
  
- Cluster failure handling -especially the loss of large portions of a large datacenter, is
something that is not currently formally tested. 
+ Cluster failure handling -especially the loss of large portions of a large datacenter, is
something that is not currently formally tested. There are big fixes that go into Hadoop to
test some of this, but loss of a quarter of the datanodes is a disaster that doesn't get tested
at scale before a release is made.
  
-  * Network failures can be simulated on some IaaS platforms
+  * Network failures can be simulated on some IaaS platforms just by breaking a virtual link
   * Forcibly killing processes is a more realistic approach which works on most platforms,
though it is hard to choreograph
  

Mime
View raw message