Return-Path: X-Original-To: apmail-crunch-commits-archive@www.apache.org Delivered-To: apmail-crunch-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B8C4110527 for ; Mon, 25 Nov 2013 05:35:36 +0000 (UTC) Received: (qmail 70549 invoked by uid 500); 25 Nov 2013 05:35:36 -0000 Delivered-To: apmail-crunch-commits-archive@crunch.apache.org Received: (qmail 70525 invoked by uid 500); 25 Nov 2013 05:35:35 -0000 Mailing-List: contact commits-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list commits@crunch.apache.org Received: (qmail 70518 invoked by uid 99); 25 Nov 2013 05:35:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Nov 2013 05:35:35 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Nov 2013 05:35:33 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id A7D742388999 for ; Mon, 25 Nov 2013 05:35:13 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r887983 - in /websites/staging/crunch/trunk/content: ./ intro.html Date: Mon, 25 Nov 2013 05:35:13 -0000 To: commits@crunch.apache.org From: buildbot@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20131125053513.A7D742388999@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: buildbot Date: Mon Nov 25 05:35:13 2013 New Revision: 887983 Log: Staging update by buildbot for crunch Modified: websites/staging/crunch/trunk/content/ (props changed) websites/staging/crunch/trunk/content/intro.html Propchange: websites/staging/crunch/trunk/content/ ------------------------------------------------------------------------------ --- cms:source-revision (original) +++ cms:source-revision Mon Nov 25 05:35:13 2013 @@ -1 +1 @@ -1545153 +1545156 Modified: websites/staging/crunch/trunk/content/intro.html ============================================================================== --- websites/staging/crunch/trunk/content/intro.html (original) +++ websites/staging/crunch/trunk/content/intro.html Mon Nov 25 05:35:13 2013 @@ -283,25 +283,25 @@ Collections Classes like java.util pipeline into the client and make decisions based on that data allows us to create sophisticated analytical applications that can modify their downstream processing based on the results of upstream computations.

Data Model and Operators

-

The Java API is centered around three interfaces that represent distributed datasets: PCollection, -PTable, and PGroupedTable.

+

The Java API is centered around three interfaces that represent distributed datasets: PCollection, +PTable, and PGroupedTable.

A PCollection<T> represents a distributed, unordered collection of elements of type T. For example, we represent a text file as a -PCollection<String> object. PCollection<T> provides a method, parallelDo, that applies a DoFn +PCollection<String> object. PCollection<T> provides a method, parallelDo, that applies a DoFn to each element in the PCollection<T> in parallel, and returns an new PCollection<U> as its result.

A PTable<K, V> is a sub-interface of PCollection<Pair<K, V>> that represents a distributed, unordered multimap of its key type K to its value type V. In addition to the parallelDo operation, PTable provides a groupByKey operation that aggregates all of the values in the PTable that have the same key into a single record. It is the groupByKey operation that triggers the sort phase of a MapReduce job. Developers can exercise fine-grained control over the number of reducers and the partitioning, grouping, and sorting strategies used during the shuffle by providing an instance -of the GroupingOptions class to the groupByKey function.

+of the GroupingOptions class to the groupByKey function.

The result of a groupByKey operation is a PGroupedTable<K, V> object, which is a distributed, sorted map of keys of type K to an Iterable that may be iterated over exactly once. In addition to parallelDo processing via DoFns, PGroupedTable provides a combineValues operation that allows a -commutative and associative Aggregator to be applied to the values of the PGroupedTable +commutative and associative Aggregator to be applied to the values of the PGroupedTable instance on both the map and reduce sides of the shuffle. A number of common Aggregator<V> implementations are provided in the -Aggregators class.

+Aggregators class.

Finally, PCollection, PTable, and PGroupedTable all support a union operation, which takes a series of distinct PCollections that all have the same data type and treats them as a single virtual PCollection.

All of the other data transformation operations supported by the Crunch APIs (aggregations, joins, sorts, secondary sorts, and cogrouping) are implemented -in terms of these four primitives. The patterns themselves are defined in the org.apache.crunch.lib +in terms of these four primitives. The patterns themselves are defined in the org.apache.crunch.lib package and its children, and a few of of the most common patterns have convenience functions defined on the PCollection and PTable interfaces.

Writing DoFns

DoFns represent the logical computations of your Crunch pipelines. They are designed to be easy to write, easy to test, and easy to deploy @@ -343,7 +343,7 @@ framework won't kill it,

Crunch provides a number of helper methods for working with Hadoop Counters, all named increment. Counters are an incredibly useful way of keeping track of the state of long running data pipelines and detecting any exceptional conditions that occur during processing, and they are supported in both the MapReduce-based and in-memory Crunch pipeline contexts. You can retrive the value of the Counters -in your client code at the end of a MapReduce pipeline by getting them from the StageResult +in your client code at the end of a MapReduce pipeline by getting them from the StageResult objects returned by Crunch at the end of a run.

(Note that there was a change in the Counters API from Hadoop 1.0 to Hadoop 2.0, and thus we do not recommend that you work with the Counter classes directly in yoru Crunch pipelines (the two getCounter methods that were defined in DoFn are both deprecated) so that you will not be @@ -362,16 +362,16 @@ will require extra memory settings to ru memory setting for the DoFn's needs before the job was launched on the cluster.

Common DoFn Patterns

The Crunch APIs contain a number of useful subclasses of DoFn that handle common data processing scenarios and are easier -to write and test. The top-level org.apache.crunch package contains three +to write and test. The top-level org.apache.crunch package contains three of the most important specializations, which we will discuss now. Each of these specialized DoFn implementations has associated methods on the PCollection, PTable, and PGroupedTable interfaces to support common data processing steps.

-

The simplest extension is the FilterFn class, which defines a single abstract method, boolean accept(T input). +

The simplest extension is the FilterFn class, which defines a single abstract method, boolean accept(T input). The FilterFn can be applied to a PCollection<T> by calling the filter(FilterFn<T> fn) method, and will return a new PCollection<T> that only contains the elements of the input PCollection for which the accept method returned true. Note that the filter function does not include a PType argument in its signature, because there is no change in the data type of the PCollection when the FilterFn is applied. It is possible to compose new FilterFn instances by combining multiple FilterFns together using the and, or, and not factory methods defined in the -FilterFns helper class.

-

The second extension is the MapFn class, which defines a single abstract method, T map(S input). +FilterFns helper class.

+

The second extension is the MapFn class, which defines a single abstract method, T map(S input). For simple transform tasks in which every input record will have exactly one output, it's easy to test a MapFn by verifying that a given input returns a every input record will have exactly one output, it's easy to test a MapFn by verifying that a given input returns a given output.

MapFns are also used in specialized methods on the PCollection and PTable interfaces. PCollection<V> defines the method @@ -380,22 +380,22 @@ function that extracts the key (of type the key be given and constructs a PTableType<K, V> from the given key type and the PCollection's existing value type. PTable<K, V>, in turn, has methods PTable<K1, V> mapKeys(MapFn<K, K1> mapFn) and PTable<K, V2> mapValues(MapFn<V, V2>) that handle the common case of converting just one of the paired values in a PTable instance from one type to another while leaving the other type the same.

-

The final top-level extension to DoFn is the CombineFn class, which is used in conjunction with +

The final top-level extension to DoFn is the CombineFn class, which is used in conjunction with the combineValues method defined on the PGroupedTable interface. CombineFns are used to represent the associative operations that can be applied using the MapReduce Combiner concept in order to reduce the amount data that is shipped over the network during a shuffle.

The CombineFn extension is different from the FilterFn and MapFn classes in that it does not define an abstract method for handling data beyond the default process method that any other DoFn would use; rather, extending the CombineFn class signals to the Crunch planner that the logic contained in this class satisfies the conditions required for use with the MapReduce combiner.

-

Crunch supports many types of these associative patterns, such as sums, counts, and set unions, via the Aggregator +

Crunch supports many types of these associative patterns, such as sums, counts, and set unions, via the Aggregator interface, which is defined right alongside the CombineFn class in the top-level org.apache.crunch package. There are a number of implementations of the Aggregator -interface defined via static factory methods in the Aggregators class.

+interface defined via static factory methods in the Aggregators class.

Serializing Data with PTypes

Why PTypes Are Necessary, the two type families, the core methods and tuples.

Extending PTypes

The simplest way to create a new PType<T> for a data object is to create a derived PType from one of the built-in PTypes for the Avro and Writable type families. If we have a base PType<S>, we can create a derived PType<T> by implementing an input MapFn<S, T> and an output MapFn<T, S> and then calling PTypeFamily.derived(Class<T>, MapFn<S, T> in, MapFn<T, S> out, PType<S> base), which will return -a new PType<T>. There are examples of derived PTypes in the PTypes class, including +a new PType<T>. There are examples of derived PTypes in the PTypes class, including serialization support for protocol buffers, Thrift records, Java Enums, BigInteger, and UUIDs.

Reading and Writing Data: Sources, Targets, and SourceTargets

MapReduce developers are familiar with the InputFormat<K, V> and OutputFormat<K, V> classes for reading and writing data during @@ -416,8 +416,8 @@ declares a Iterable<T> read( or into a DoFn implementation that can use the data read from the source to perform additional transforms on the main input data that is processed using the DoFn's process method (this is how Crunch supports mapside-join operations.)

Support for the most common Source, Target, and SourceTarget implementations are provided by the factory functions declared in the -From (Sources), To (Targets), and -At (SourceTargets) classes in the org.apache.crunch.io +From (Sources), To (Targets), and +At (SourceTargets) classes in the org.apache.crunch.io package.

Pipeline Building and Execution

Creating A New Crunch Pipeline