crunch-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jwi...@apache.org
Subject svn commit: r1545153 - in /crunch/site/trunk/content: apidocs/current intro.mdtext
Date Mon, 25 Nov 2013 05:20:31 GMT
Author: jwills
Date: Mon Nov 25 05:20:31 2013
New Revision: 1545153

URL: http://svn.apache.org/r1545153
Log:
Add intro content fixes and current apidocs link

Added:
    crunch/site/trunk/content/apidocs/current   (with props)
Modified:
    crunch/site/trunk/content/intro.mdtext

Added: crunch/site/trunk/content/apidocs/current
URL: http://svn.apache.org/viewvc/crunch/site/trunk/content/apidocs/current?rev=1545153&view=auto
==============================================================================
--- crunch/site/trunk/content/apidocs/current (added)
+++ crunch/site/trunk/content/apidocs/current Mon Nov 25 05:20:31 2013
@@ -0,0 +1 @@
+link 0.8.0
\ No newline at end of file

Propchange: crunch/site/trunk/content/apidocs/current
------------------------------------------------------------------------------
    svn:special = *

Modified: crunch/site/trunk/content/intro.mdtext
URL: http://svn.apache.org/viewvc/crunch/site/trunk/content/intro.mdtext?rev=1545153&r1=1545152&r2=1545153&view=diff
==============================================================================
--- crunch/site/trunk/content/intro.mdtext (original)
+++ crunch/site/trunk/content/intro.mdtext Mon Nov 25 05:20:31 2013
@@ -139,7 +139,7 @@ supports two serialization frameworks, c
 You can read more about how to work with Crunch's serialization libraries here. TODO
 
 Because all of the core logic in our application is exposed via a single static method that
operates on Crunch interfaces, we can use Crunch's
-in-memory API to test our business logic using a unit testing framework like JUnit. Let's
look at an exampel unit test for the word count
+in-memory API to test our business logic using a unit testing framework like JUnit. Let's
look at an example unit test for the word count
 application:
 
     package org.myorg;
@@ -173,28 +173,31 @@ applications that can modify their downs
 
 ### Data Model and Operators
 
-The Java API is centered around three interfaces that represent distributed datasets: `PCollection<T>`,
`PTable<K, V>`, and `PGroupedTable<K, V>`.
+The Java API is centered around three interfaces that represent distributed datasets: [PCollection<T>](apidocs/current/org/apache/crunch/PCollection.html),
+[PTable<K, V>](http://crunch.apache.org/apidocs/current/org/apache/crunch/PTable.html),
and [PGroupedTable<K, V>](apidocs/current/org/apache/crunch/PGroupedTable.html).
 
 A `PCollection<T>` represents a distributed, unordered collection of elements of type
T. For example, we represent a text file as a
-`PCollection<String>` object. PCollection provides a method, `parallelDo`, that applies
a `DoFn` to each element in a PCollection in parallel,
-and returns a new PCollection as its result. 
+`PCollection<String>` object. `PCollection<T>` provides a method, `parallelDo`,
that applies a [DoFn<T, U>](apidocs/current/org/apache/crunch/DoFn.html)
+to each element in the `PCollection<T>` in parallel, and returns an new `PCollection<U>`
as its result.
 
-A `PTable<K, V>` is a sub-interface of PCollection that represents a distributed, unordered
multimap of its key type K to its value type V.
+A `PTable<K, V>` is a sub-interface of `PCollection<Pair<K, V>>` that represents
a distributed, unordered multimap of its key type K to its value type V.
 In addition to the parallelDo operation, PTable provides a `groupByKey` operation that aggregates
all of the values in the PTable that
-have the same key into a single record. It is the groupByKey operation that triggers the
sort phase of a MapReduce job.
-
-The result of a groupByKey operation is a `PGroupedTable<K, V>` object, which is a
distributed, sorted map of keys of type K to an Iterable
-collection of values of type V. In addition to parallelDo, the PGroupedTable provides a `combineValues`
operation, which allows for
-a commutative and associative aggregation operator to be applied to the values of the PGroupedTable
instance on both the map side and the
-reduce side of a MapReduce job.
+have the same key into a single record. It is the groupByKey operation that triggers the
sort phase of a MapReduce job. Developers can exercise
+fine-grained control over the number of reducers and the partitioning, grouping, and sorting
strategies used during the shuffle by providing an instance
+of the [GroupingOptions](apidocs/current/org/apache/crunch/GroupingOptions.html) class to
the `groupByKey` function.
+
+The result of a groupByKey operation is a `PGroupedTable<K, V>` object, which is a
distributed, sorted map of keys of type K to an Iterable<V> that may
+be iterated over exactly once. In addition to `parallelDo` processing via DoFns, PGroupedTable
provides a `combineValues` operation that allows a
+commutative and associative [Aggregator<V>](apidocs/current/org/apache/crunch/Aggregator.html)
to be applied to the values of the PGroupedTable
+instance on both the map and reduce sides of the shuffle. A number of common `Aggregator<V>`
implementations are provided in the
+[Aggregators](apidocs/current/org/apache/crunch/fn/Aggregators.html) class.
 
 Finally, PCollection, PTable, and PGroupedTable all support a `union` operation, which takes
a series of distinct PCollections that all have
 the same data type and treats them as a single virtual PCollection.
 
-All of the other MapReduce patterns supported by the Crunch APIs (aggregations, joins, sorts,
secondary sorts, and cogrouping) are all implemented
-in terms of these four primitives. The patterns themselves are defined in the `org.apache.crunch.lib`
package and its children, and a few of
-the most common patterns have convenience functions defined on the PCollection and PTable
interfaces. We will do a more detailed review of these
-patterns later in this document, but here are a few examples to get you started: TODO
+All of the other data transformation operations supported by the Crunch APIs (aggregations,
joins, sorts, secondary sorts, and cogrouping) are implemented
+in terms of these four primitives. The patterns themselves are defined in the [org.apache.crunch.lib](apidocs/current/org/apache/crunch/lib/package-summary.html)
+package and its children, and a few of of the most common patterns have convenience functions
defined on the PCollection and PTable interfaces.
 
 ### Writing DoFns
 
@@ -205,10 +208,9 @@ how to use them effectively is critical 
 #### DoFn extends Serializable
 
 The most important thing to remember about DoFns is that they all implement the `java.io.Serializable`
interface, which means that all of the
-state information associated with a DoFn must also be serializable. There is an excellent
overview of Java serializability here that is worth
-reviewing if you aren't familiar with Java's serializability model. TODO
+state information associated with a DoFn must also be serializable. There is an [excellent
overview of Java serializability](http://docs.oracle.com/javase/tutorial/jndi/objects/serial.html)
that is worth reviewing if you aren't familiar with it already.
 
-If your DoFn needs to work with a class that does not implement Serializable and cannot be
modified (e.g., because it is defined in a third-party
+If your DoFn needs to work with a class that does not implement Serializable and cannot be
modified (for example, because it is defined in a third-party
 library), you should use the `transient` keyword on that member variable so that serializing
the DoFn won't fail if that object happens to be
 defined. You can create an instance of the object during runtime using the `initialize` method
described in the following section.
 
@@ -217,22 +219,22 @@ defined. You can create an instance of t
 After the Crunch runtime loads the serialized DoFns into its map and reduce tasks, the DoFns
are executed on the input data via the following
 sequence:
 
-# First, the DoFn is given access to the `TaskInputOutputContext` implementation for the
current task. This allows the DoFn to access any
+1. First, the DoFn is given access to the `TaskInputOutputContext` implementation for the
current task. This allows the DoFn to access any
 necessary configuration and runtime information needed before or during processing.
-# Next, the DoFn's `initialize` method is called. The initialize method is similar to the
`setup` method used in the Mapper and Reducer classes;
+2. Next, the DoFn's `initialize` method is called. The initialize method is similar to the
`setup` method used in the Mapper and Reducer classes;
 it is called before processing begins in order to enable any necessary initialization or
configuration of the DoFn to be performed. For example,
 if we were making use of a non-serializable third-party library, we would create an instance
of it here.
-# At this point, data processing begins. The map or reduce task will begin passing records
in to the DoFn's `process` method, and capturing the
+3. At this point, data processing begins. The map or reduce task will begin passing records
in to the DoFn's `process` method, and capturing the
 output of the process method into an `Emitter<T>` that can either pass the data along
to another DoFn for processing or serialize it as the output
 of the current processing stage.
-# Finally, after all of the records have been processed, the `void cleanup(Emitter<T>
emitter)` method is called on each DoFn. The cleanup method
+4. Finally, after all of the records have been processed, the `void cleanup(Emitter<T>
emitter)` method is called on each DoFn. The cleanup method
 has a dual purpose: it can be used to emit any state information that the DoFn wants to pass
along to the next stage (for example, cleanup could
 be used to emit the sum of a list of numbers that was passed in to the DoFn's process method),
as well as to release any resources or perform any
 other cleanup task that is appropriate once the job has finished executing.
 
 #### Accessing Runtime MapReduce APIs
 
-DoFns provide direct access to the `TaskInputOutputContext` object that is used within a
given Map or Reduce task via the protected `getContext`
+DoFns provide direct access to the `TaskInputOutputContext` object that is used within a
given Map or Reduce task via the `getContext`
 method. There are also a number of helper methods for working with the objects associated
with the TaskInputOutputContext, including:
 
 * `getConfiguration()` for accessing the `Configuration` object that contains much of the
detail about system and user-specific parameters for a
@@ -242,19 +244,23 @@ framework won't kill it,
 * `setStatus(String status)` and `getStatus` for setting task status information, and
 * `getTaskAttemptID()` for accessing the current `TaskAttemptID` information.
 
-Crunch provides a number of helper methods, all named `increment` and having various signatures,
for working with Hadoop Counters.
-There was a change in the Counters API from Hadoop 1.0 to Hadoop 2.0, and thus we do not
recommend that you work with the `Counter` classes
-directly in your Crunch pipelines (the two `getCounter` methods that were defined in DoFn
are both deprecated) so that you will not be
-required to recompile your job jars when you move from a Hadoop 1.x cluster to a Hadoop 2.x
cluster.
+Crunch provides a number of helper methods for working with [Hadoop Counters](http://codingwiththomas.blogspot.com/2011/04/controlling-hadoop-job-recursion.html),
all named `increment`. Counters are an incredibly useful way of keeping track of the state
of long running data pipelines and detecting any exceptional conditions that
+occur during processing, and they are supported in both the MapReduce-based and in-memory
Crunch pipeline contexts. You can retrive the value of the Counters
+in your client code at the end of a MapReduce pipeline by getting them from the [StageResult](apidocs/current/org/apache/crunch/PipelineResult.StageResult.html)
+objects returned by Crunch at the end of a run.
+
+(Note that there was a change in the Counters API from Hadoop 1.0 to Hadoop 2.0, and thus
we do not recommend that you work with the
+Counter classes directly in yoru Crunch pipelines (the two `getCounter` methods that were
defined in DoFn are both deprecated) so that you will not be
+required to recompile your job jars when you move from a Hadoop 1.0 cluster to a Hadoop 2.0
cluster.)
 
 #### Configuring the Crunch Planner and MapReduce Jobs with DoFns
 
 Although most of the DoFn methods are focused on runtime execution, there are a handful of
methods that are used during the planning phase
 before a pipeline is converted into MapReduce jobs. The first of these functions is `float
scaleFactor()`, which should return a floating point
 value greater than 0.0f. You can override the scaleFactor method in your custom DoFns in
order to provide a hint to the Crunch planner about
-how much larger (or smaller) an input data set will become after passing through the process
method. If the groupByKey method is called without
+how much larger (or smaller) an input data set will become after passing through the process
method. If the `groupByKey` method is called without
 an explicit number of reducers provided, the planner will try to guess how many reduce tasks
should be used for the job based on the size of
-the input data, which is determined in part by using the scaleFactor results.
+the input data, which is determined in part by using the result of calling the `scaleFactor`
method on the DoFns in the processing path.
 
 Sometimes, you may know that one of your DoFns has some unusual parameter settings that need
to be specified on any job that includes that
 DoFn as part of its processing. A DoFn can modify the Hadoop Configuration object that is
associated with the MapReduce job it is assigned to
@@ -262,22 +268,23 @@ on the client before processing begins b
 will require extra memory settings to run, and so you could make sure that the value of the
`mapred.child.java.opts` argument had a large enough
 memory setting for the DoFn's needs before the job was launched on the cluster.
 
-#### DoFn Extensions and Helper Classes
+#### Common DoFn Patterns
 
 The Crunch APIs contain a number of useful subclasses of DoFn that handle common data processing
scenarios and are easier
-to write and test. The top-level `org.apache.crunch` package contains three of the most important
specializations, which we will
-discuss now. Each of these specialized DoFn implementations has associated methods on the
PCollection, PTable, and PGroupedTable
-interfaces to support these common data processing tasks.
-
-The simplest extension is the `FilterFn<T>` class, which defines a single abstract
method, `boolean accept(T input)`. The FilterFn can be applied
-to a `PCollection<T>` by calling the `filter(FilterFn<T> fn)` method, and will
return a new `PCollection<T>` that only contains the elements
-of the input PCollection for which the accept method returned true. Note that the filter
function does not include a PType argument in its
+to write and test. The top-level [org.apache.crunch](apidocs/current/org/apache/crunch/package-summary.html)
package contains three
+of the most important specializations, which we will discuss now. Each of these specialized
DoFn implementations has associated methods
+on the PCollection, PTable, and PGroupedTable interfaces to support common data processing
steps.
+
+The simplest extension is the [FilterFn<T>](apidocs/current/org/apache/crunch/FilterFn.html)
class, which defines a single abstract method, `boolean accept(T input)`.
+The FilterFn can be applied to a `PCollection<T>` by calling the `filter(FilterFn<T>
fn)` method, and will return a new `PCollection<T>` that only contains
+the elements of the input PCollection for which the accept method returned true. Note that
the filter function does not include a PType argument in its
 signature, because there is no change in the data type of the PCollection when the FilterFn
is applied. It is possible to compose new FilterFn
-instances by combining multiple FilterFns together using the `and`, `or`, and `not` factory
methods defined in the FilterFns helper class.
+instances by combining multiple FilterFns together using the `and`, `or`, and `not` factory
methods defined in the
+[FilterFns](apidocs/current/org/apache/crunch/fn/FilterFns.html) helper class.
 
-The second extension is the `MapFn<S, T>` class, which defines a single abstract method,
`T map(S input)`. For simple transform tasks in which
-every input record will have exactly one output, it's easy to test a MapFn by verifying that
a given input returns a given output. MapFns are
-also used by Crunch's data serialization libraries to map between serialized data types (such
as Writables or Avro records) and POJOs.
+The second extension is the [MapFn<S, T>](apidocs/current/org/apache/crunch/MapFn.html)
class, which defines a single abstract method, `T map(S input)`.
+For simple transform tasks in which every input record will have exactly one output, it's
easy to test a MapFn by verifying that a given input returns a
+every input record will have exactly one output, it's easy to test a MapFn by verifying that
a given input returns a given output.
 
 MapFns are also used in specialized methods on the PCollection and PTable interfaces. `PCollection<V>`
defines the method
 `PTable<K,V> by(MapFn<V, K> mapFn, PType<K> keyType)` that can be used
to create a PTable from a PCollection by writing a
@@ -286,14 +293,17 @@ the key be given and constructs a `PTabl
 has methods `PTable<K1, V> mapKeys(MapFn<K, K1> mapFn)` and `PTable<K, V2>
mapValues(MapFn<V, V2>)` that handle the common case of converting
 just one of the paired values in a PTable instance from one type to another while leaving
the other type the same.
 
-The final top-level extension to DoFn is the `CombineFn<K, V>` class, which is used
in conjunction with the `combineValues` method defined on the
-PGroupedTable interface. CombineFns are used to represent the associative operations that
can be applied using the MapReduce Combiner concept in
-order to reduce the amount of data that is shipped over the network during the shuffle. The
CombineFn extension is different from the FilterFn and
-MapFn classes in that it does not define an abstract method for handling data besides the
default `process` method that any other DoFn would use;
-rather, extending the CombineFn class signals to the Crunch planner that the logic contained
in this class satisfies the conditions required for use
-with the MapReduce combiner. Crunch supports many types of these associative patterns, such
as sums, counts, and set unions, via the `Aggregator<V>` interface,
-which is defined right alongside the CombineFn class in the top-level `org.apache.crunch`
package. There are a number of implementations of the Aggregator
-interface defined via static factory methods in the `org.apache.crunch.fn.Aggregators` class.
+The final top-level extension to DoFn is the [CombineFn<K, V>](apidocs/current/org/apache/crunch/CombineFn.html)
class, which is used in conjunction with
+the `combineValues` method defined on the PGroupedTable interface. CombineFns are used to
represent the associative operations that can be applied using
+the MapReduce Combiner concept in order to reduce the amount data that is shipped over the
network during a shuffle.
+
+The CombineFn extension is different from the FilterFn and MapFn classes in that it does
not define an abstract method for handling data
+beyond the default `process` method that any other DoFn would use; rather, extending the
CombineFn class signals to the Crunch planner that the logic
+contained in this class satisfies the conditions required for use with the MapReduce combiner.
+
+Crunch supports many types of these associative patterns, such as sums, counts, and set unions,
via the [Aggregator<V>](apidocs/current/org/apache/crunch/Aggregator.html)
+interface, which is defined right alongside the CombineFn class in the top-level `org.apache.crunch`
package. There are a number of implementations of the Aggregator
+interface defined via static factory methods in the [Aggregators](apidocs/current/org/apache/crunch/fn/Aggregators.html)
class.
 
 ### Serializing Data with PTypes
 
@@ -301,16 +311,43 @@ Why PTypes Are Necessary, the two type f
 
 #### Extending PTypes
 
-### Reading Data: Sources
-
-### Writing Data: Targets
+The simplest way to create a new `PType<T>` for a data object is to create a _derived_
PType from one of the built-in PTypes for the Avro
+and Writable type families. If we have a base `PType<S>`, we can create a derived `PType<T>`
by implementing an input `MapFn<S, T>` and an
+output `MapFn<T, S>` and then calling `PTypeFamily.derived(Class<T>, MapFn<S,
T> in, MapFn<T, S> out, PType<S> base)`, which will return
+a new `PType<T>`. There are examples of derived PTypes in the [PTypes](apidocs/current/org/apache/crunch/types/PTypes.html)
class, including
+serialization support for protocol buffers, Thrift records, Java Enums, BigInteger, and UUIDs.
+
+### Reading and Writing Data: Sources, Targets, and SourceTargets
+
+MapReduce developers are familiar with the `InputFormat<K, V>` and `OutputFormat<K,
V>` classes for reading and writing data during
+MapReduce processing. Crunch has the analogous concepts of a `Source<T>` for reading
data and a `Target` for writing data. For data
+sources that may be treated as both the output of one pipeline phase and the input to another,
Crunch has a `SourceTarget<T>` interface
+that combines the functionality of both `Source<T>` and `Target`.
+
+Sources and Targets provide several useful extensions to the functionality provided by InputFormat
and OutputFormat. First, a Source can
+encapsulate an InputFormat as well as any special Configuration settings that are needed
by that InputFormat. For example, the
+`AvroInputFormat` needs to know the Avro schema of the input Avro file and expects to find
that schema associated with the "avro.schema" key
+in the `Configuration` object for a pipeline. But if you need to read multiple Avro files,
each with its own schema, during a single MapReduce
+job, you need a way of ensuring that the different schemas for each file do not all overwrite
the "avro.schema" key in the shared
+`Configuration` object. Crunch's `Source<T>` allows you to specify a set of key-value
entries that need to be set in the `Configuration`
+before a particular input is read in a way that prevents them from conflicting with each
other, while the Target interface provides the same
+functionality for OutputFormats.
+
+The `Source<T>` interface has two useful extensions. The first is `TableSource<K,
V>` which extends `Source<Pair<K, V>>` and can be
+used to read in a `PTable<K, V>` instance instead of a `PCollection<Pair<K, V>>`
instance. The second extension is `ReadableSource<T>`, which
+declares a `Iterable<T> read(Configuration conf)` method that allows the contents of
the Source to be read directly, either into the client
+or into a DoFn implementation that can use the data read from the source to perform additional
transforms on the main input data that is
+processed using the DoFn's `process` method (this is how Crunch supports mapside-join operations.)
+
+Support for the most common Source, Target, and SourceTarget implementations are provided
by the factory functions declared in the
+[From](apidocs/current/org/apache/crunch/io/From.html) (Sources), [To](apidocs/current/org/apache/crunch/io/To.html)
(Targets), and
+[At](apidocs/current/org/apache/crunch/io/At.html) (SourceTargets) classes in the [org.apache.crunch.io](apidocs/current/org/apache/crunch/io/package-summary.html)
+package.
 
 ### Pipeline Building and Execution
 
 #### Creating A New Crunch Pipeline
 
-Section here on Configuration of pipelines.
-
 #### Managing Pipeline Execution and Cleanup
 
 ## More Information



Mime
View raw message