incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Trivial Update of "CrunchProposal" by JoshWills
Date Fri, 18 May 2012 20:33:42 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "CrunchProposal" page has been changed by JoshWills:
http://wiki.apache.org/incubator/CrunchProposal?action=diff&rev1=1&rev2=2

- = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
+ = Crunch - Easy, Efficient !MapReduce Pipelines in Java and Scala =
  
  == Abstract ==
  
- Crunch is a Java library for writing, testing, and running pipelines of MapReduce jobs on
Apache Hadoop.
+ Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs
on Apache Hadoop.
  
  == Proposal ==
  
- Crunch is a Java library for writing, testing, and running pipelines of MapReduce jobs on
Apache Hadoop. Its main goal is to provide a high-level API for writing and testing complex
MapReduce jobs that require multiple processing stages.  It has a simple, flexible, and extensible
data model that makes it ideal for processing data that does not naturally fit into a relational
structure, such as time series and serialized object formats like JSON and Avro. It supports
running pipelines either as a series of MapReduce jobs on an Apache Hadoop cluster or in memory
on a single machine for fast testing and debugging.
+ Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs
on Apache Hadoop. Its main goal is to provide a high-level API for writing and testing complex
!MapReduce jobs that require multiple processing stages.  It has a simple, flexible, and extensible
data model that makes it ideal for processing data that does not naturally fit into a relational
structure, such as time series and serialized object formats like JSON and Avro. It supports
running pipelines either as a series of !MapReduce jobs on an Apache Hadoop cluster or in
memory on a single machine for fast testing and debugging.
  
  == Background ==
  
- Crunch was initially developed by Cloudera to simplify the process of creating sequences
of dependent MapReduce jobs, especially jobs that processed non-relational data like time
series. Its design was based on a paper Google published about a Java library they developed
called FlumeJava that was created in order to solve a similar class of problems. Crunch was
open-sourced by Cloudera on GitHub as an Apache 2.0 licensed project in October 2011. During
this time Crunch has been formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
(February 2012), with an incremental update to version 0.2.1 (March 2012) .  These releases
are also distributed by Cloudera as source and binaries from Cloudera's Maven repository.
+ Crunch was initially developed by Cloudera to simplify the process of creating sequences
of dependent !MapReduce jobs, especially jobs that processed non-relational data like time
series. Its design was based on a paper Google published about a Java library they developed
called FlumeJava that was created in order to solve a similar class of problems. Crunch was
open-sourced by Cloudera on GitHub as an Apache 2.0 licensed project in October 2011. During
this time Crunch has been formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
(February 2012), with an incremental update to version 0.2.1 (March 2012) .  These releases
are also distributed by Cloudera as source and binaries from Cloudera's Maven repository.
  
  == Rationale ==
  
- Most of the interesting analytical and data processing tasks that are run on an Apache Hadoop
cluster require a series of MapReduce jobs to be executed in sequence. Developers who are
creating these pipelines today need to manually assign the sequence of tasks to perform in
a dependent chain of MapReduce jobs, even though there are a number of well-known patterns
for fusing dependent computations together into a single MapReduce stage and for performing
common types of joins and aggregations. This results in MapReduce pipelines that are more
difficult to test, maintain, and extend to support new functionality.
+ Most of the interesting analytical and data processing tasks that are run on an Apache Hadoop
cluster require a series of !MapReduce jobs to be executed in sequence. Developers who are
creating these pipelines today need to manually assign the sequence of tasks to perform in
a dependent chain of !MapReduce jobs, even though there are a number of well-known patterns
for fusing dependent computations together into a single !MapReduce stage and for performing
common types of joins and aggregations. This results in !MapReduce pipelines that are more
difficult to test, maintain, and extend to support new functionality.
  
  Furthermore, the type of data that is being stored and processed using Apache Hadoop is
evolving. Although Hadoop was originally used for storing large volumes of structured text
in the form of webpages and log files, it is now common for Hadoop to store complex, structured
data formats such as JSON, Apache Avro, and Apache Thrift. These formats allow developers
to work with serialized objects in programming languages like Java, C++, and Python, and allow
for new types of analysis to be performed on complex data types. Hadoop has also been adopted
by the scientific research community, who are using Hadoop to process time series data, structured
binary files in the HDF5 format, and large medical and satellite images.
  
- Crunch addresses these challenges by providing a lightweight and extensible Java API for
defining the stages of a data processing pipeline, which can then be run on an Apache Hadoop
cluster as a sequence of dependent MapReduce jobs, or in-memory on a single machine to facilitate
fast testing and debugging. Crunch relies on a small set of primitive abstractions that represent
immutable, distributed collections of objects. Developers define functions that are applied
to those objects in order to generate new immutable, distributed collections of objects. Crunch
also provides a library of common MapReduce patterns for performing efficient joins and aggregation
operations over these distributed collections that developers may integrate into their own
pipelines. Crunch also provides native support for processing structured binary data formats
like JSON, Apache Avro, and Apache Thrift, and is designed to be extensible to support working
with any kind of data format that Java supports in its native form.
+ Crunch addresses these challenges by providing a lightweight and extensible Java API for
defining the stages of a data processing pipeline, which can then be run on an Apache Hadoop
cluster as a sequence of dependent !MapReduce jobs, or in-memory on a single machine to facilitate
fast testing and debugging. Crunch relies on a small set of primitive abstractions that represent
immutable, distributed collections of objects. Developers define functions that are applied
to those objects in order to generate new immutable, distributed collections of objects. Crunch
also provides a library of common !MapReduce patterns for performing efficient joins and aggregation
operations over these distributed collections that developers may integrate into their own
pipelines. Crunch also provides native support for processing structured binary data formats
like JSON, Apache Avro, and Apache Thrift, and is designed to be extensible to support working
with any kind of data format that Java supports in its native form.
  
  == Initial Goals ==
  
@@ -29, +29 @@

  Some goals include:
   * To stand up a sustaining Apache-based community around the Crunch codebase.
   * Improved documentation of Java libraries and best practices.
-  * Support the ability to "fuse" logically independent pipeline stages that aggregate the
same data in different ways into a single MapReduce job.
+  * Support the ability to "fuse" logically independent pipeline stages that aggregate the
same data in different ways into a single !MapReduce job.
   * Performance, usability, and robustness improvements.
-  * Improving diagnostic reporting and debugging for individual MapReduce jobs.
+  * Improving diagnostic reporting and debugging for individual !MapReduce jobs.
   * Providing a centralized place for contributed extensions and domain-specific applications.
  
  = Current Status =
@@ -52, +52 @@

   * Brock Noland: Wrote many of the test cases, user documentation, and contributed several
bug fixes.
   * Josh Wills: Josh wrote much of the original Crunch code.
   * Gabriel Reid: Gabriel significantly improved Crunch's handling of Avro data and has contributed
several bug fixes for the core planner.
-  * Tom White: Tom added several libraries for common MapReduce pipeline operations, including
the sort library and a library of set operations.
+  * Tom White: Tom added several libraries for common !MapReduce pipeline operations, including
the sort library and a library of set operations.
   * Christian Tzolov: Christian has contributed several bug fixes for the Avro serialization
module and the unit testing framework.
   * Robert Chu: Robert did the left/right/outer join implementations for Crunch and fixed
several bugs in the runtime configuration logic.
  
@@ -60, +60 @@

  
  == Alignment ==
  
- Crunch complements several current Apache projects. It complements Hadoop MapReduce by providing
a higher-level API for developing complex data processing pipelines that require a sequence
of MapReduce jobs to perform. Crunch also supports Apache HBase in order to simplify the process
of writing MapReduce jobs that execute over HBase tables. Crunch makes extensive use of the
Apache Avro data format as an internal data representation process that makes MapReduce jobs
execute quickly and efficiently.
+ Crunch complements several current Apache projects. It complements Hadoop !MapReduce by
providing a higher-level API for developing complex data processing pipelines that require
a sequence of !MapReduce jobs to perform. Crunch also supports Apache HBase in order to simplify
the process of writing !MapReduce jobs that execute over HBase tables. Crunch makes extensive
use of the Apache Avro data format as an internal data representation process that makes !MapReduce
jobs execute quickly and efficiently.
  
  = Known Risks =
  
@@ -84, +84 @@

  
  Crunch depends upon other Apache Projects: Apache Hadoop, Apache HBase, Apache Log4J, Apache
Thrift, Apache Avro, and multiple Apache Commons components. Its build depends upon Apache
Maven.
  
- Crunch's functionality has some indirect or direct overlap with the functionality of Apache
Pig and Apache Hive but has several significant differences in terms of their user community
and the types of data they are designed to work with.  Both Hive and Pig are high-level languages
that are designed to allow non-programmers to quickly create and run MapReduce jobs. Crunch
is a Java library whose primary community is Java developers who are creating scalable data
pipelines and MapReduce-based applications. Additionally, Hive and Pig both employ a relational,
tuple-oriented data model on top of HDFS, which introduces overhead and limits expressive
power for developers who are working with serialized objects and non-relational data types.
Crunch uses a lower-level data model that gives developers the freedom to work with data in
a format that is optimized for the problem they are trying to solve.
+ Crunch's functionality has some indirect or direct overlap with the functionality of Apache
Pig and Apache Hive but has several significant differences in terms of their user community
and the types of data they are designed to work with.  Both Hive and Pig are high-level languages
that are designed to allow non-programmers to quickly create and run !MapReduce jobs. Crunch
is a Java library whose primary community is Java developers who are creating scalable data
pipelines and !MapReduce-based applications. Additionally, Hive and Pig both employ a relational,
tuple-oriented data model on top of HDFS, which introduces overhead and limits expressive
power for developers who are working with serialized objects and non-relational data types.
Crunch uses a lower-level data model that gives developers the freedom to work with data in
a format that is optimized for the problem they are trying to solve.
  
  == An Excessive Fascination with the Apache Brand ==
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message