incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "MRUnitProposal" by esammer
Date Tue, 15 Feb 2011 19:04:59 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "MRUnitProposal" page has been changed by esammer.
http://wiki.apache.org/incubator/MRUnitProposal?action=diff&rev1=4&rev2=5

--------------------------------------------------

  ## page was copied from WhirrProposal
- = Whirr, a library of cloud services =
+ = MRUnit, a library to support unit testing of Hadoop MapReduce jobs =
+ 
+ '''This proposal is still incomplete (in progress) - esammer'''
  
  == Abstract ==
- Whirr will be a set of libraries for running cloud services.
+ MRUnit is a java library that provides mocks and infrastructure for writing unit tests for
Hadoop MapReduce jobs and related components.
  
  == Proposal ==
- Whirr will provide code for running a variety of software services on cloud infrastructure.
It will provide bindings in several languages (e.g. Python and Java) for popular cloud providers
to make it easy to start and stop services like Hadoop clusters. The project will not be limited
to a particular set of services, rather it will be expected that a range of services are developed,
as determined by the project contributors. Possible services include Hadoop, HBase, !ZooKeeper,
Cassandra.
+ MRUnit is a java library to facilitate unit testing of Hadoop MapReduce jobs by providing
drivers and mock objects to simulate the Hadoop runtime environment of a map reduce job. This
code base already exists as a subproject of the Apache Hadoop TLP and lives in the "contrib"
directory of the source tree.
  
  == Background ==
- The ability to run services on cloud providers is very useful, particularly for proofs of
concept, testing, and also ad hoc production work. Bringing up clusters in the cloud is non-trivial,
since careful choreography is required. (Designing an interface that is convenient as well
as secure is also a challenge in a cloud context.)  Making services that runs on a variety
of cloud providers is harder, even with the availability of libraries like libcloud and jclouds,
since each platform's quirks and extra features must be considered (and either worked around,
or possibly taken advantage of, as appropriate) . Whirr will facilitate sharing of best practices,
both for a particular service (such as Hadoop configuration on a particular provider), and
for common cloud operations (such as installation of dependencies across cloud providers).
It will provide a space to share good configurations and will encode service-specific knowledge.
+ Writing unit tests of map reduce jobs can be a tedious process. User code can quickly become
entangled with Hadoop APIs making testing difficult and error prone. In many cases, users
will simply forgo testing given the complexity of the environment. MRUnit was created as a
simple library users can use in conjunction with test suites like JUnit to provide a harness
for injecting appropriate mock objects.
  
  == Rationale ==
- There are already scripts in the Hadoop project that allow users to run Hadoop clusters
on Amazon EC2 and other cloud providers. While users have found these scripts useful, their
current home as a Hadoop Common contrib project has the following limitations:
-  * Tying the scripts' release cycle to Hadoop's means that it is difficult to distribute
updates to the scripts which are changing fast (new features and bugfixes).
-  * The scripts support multiple versions of Hadoop, so it makes more sense to distribute
them separately from Hadoop itself.
-  * They are general: people want to contribute code for non-Hadoop services like Cassandra
(for example: http://github.com/johanoskarsson/cassandra-ec2).
-  * Having a uniform approach to running services in the cloud, hosted in one project, makes
launching sets of complementary services easier for the user. Today, the scripts and libraries
hosted within each project (e.g. in Hadoop, HBase, Cassandra) have slightly different conventions
and semantics, and are likely to diverge over time. Building a community around cloud infrastructure
services will help enforce a common approach to running services in the cloud.
+ MRUnit has existed as a contrib component of Apache Hadoop. This has served to introduce
users to the library and to provide necessary functionality to developers in the form of development
support. That said, MRUnit is not necessarily an intrinsic component of Hadoop proper and
could benefit from being a standalone project in that:
+  * A separate project would support an independent development and release schedule allowing
for faster iteration and response to user requests.
+  * Separating adjunct projects from the core Hadoop codebase simplifies Hadoop's build and
release.
+  * MRUnit users can get a simpler artifact in a way most appropriate to development time
(i.e. Maven or Ivy repositories).
+  * MRUnit can build out independent support for different versions of Hadoop without requiring
circular dependencies or testing issues.
+ 
+ Having greater development and tooling support for Hadoop makes the project accessible to
a wider audience by reducing the chance of bugs.
  
  == Initial Goals ==
-  * Provide a new home for the existing Hadoop cloud scripts.
+  * Provide a new home for the existing codebase.
+  * Make artifacts available via Maven and / or Ivy.
+  * Expand test support for other Hadoop components (e.g. Partitioners)
+  * Establish a lightweight, independent release cycle.
-  * Add more services (e.g. HBase)
-  * Develop Java libraries for Hadoop clusters
-  * Add new cloud providers by taking advantage of libcloud and jclouds.
-  * (Future) Run on own hardware, so users can take advantage of the same interface to control
services running locally or in the cloud.
  
  == Current Status ==
  === Meritocracy ===
- The Hadoop scripts were originally created by Tom White, and have had a substantial number
of contributions from members of the Hadoop community. By becoming its own project, significant
contributors to Whirr would become committers, and allow the project to grow.
+ MRUnit was originally created by Aaron Kimball, and have had some contributions from members
of the Hadoop community. By becoming its own project, significant contributors to MRUnit would
become committers, and allow the project to grow.
  
  === Community ===
  The community interested in cloud service infrastructure is currently spread across many
smaller projects, and one of the main goals of this project is to build a vibrant community
to share best practices and build common infrastructure. For example, this project would provide
a home to facilitate collaboration between the groups of Hadoop and HBase developers who are
building cloud services.

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message