crunch-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r918395 - in /websites/staging/crunch/trunk/content: ./ user-guide.html
Date Mon, 04 Aug 2014 17:50:29 GMT
Author: buildbot
Date: Mon Aug  4 17:50:29 2014
New Revision: 918395

Log:
Staging update by buildbot for crunch

Modified:
    websites/staging/crunch/trunk/content/   (props changed)
    websites/staging/crunch/trunk/content/user-guide.html

Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Aug  4 17:50:29 2014
@@ -1 +1 @@
-1602067
+1615711

Modified: websites/staging/crunch/trunk/content/user-guide.html
==============================================================================
--- websites/staging/crunch/trunk/content/user-guide.html (original)
+++ websites/staging/crunch/trunk/content/user-guide.html Mon Aug  4 17:50:29 2014
@@ -187,7 +187,7 @@
 </ol>
 </li>
 <li><a href="#sorting">Sorting</a><ol>
-<li><a href="#stdsort">Standard and Reveserse Sorting</a></li>
+<li><a href="#stdsort">Standard and Reverse Sorting</a></li>
 <li><a href="#secsort">Secondary Sorts</a></li>
 </ol>
 </li>
@@ -308,34 +308,34 @@ top of Apache Hadoop:</p>
 into more detail about their usage in the rest of the guide.</p>
 <p><a name="datamodel"></a></p>
 <h3 id="data-model-and-operators">Data Model and Operators</h3>
-<p>Crunch's Java API is centered around three interfaces that represent distributed datasets: <a href="apidocs/0.9.0/org/apache/crunch/PCollection.html">PCollection<T></a>,
-<a href="http://crunch.apache.org/apidocs/0.9.0/org/apache/crunch/PTable.html">PTable<K, V></a>, and <a href="apidocs/0.9.0/org/apache/crunch/PGroupedTable.html">PGroupedTable<K, V></a>.</p>
+<p>Crunch's Java API is centered around three interfaces that represent distributed datasets: <a href="apidocs/0.10.0/org/apache/crunch/PCollection.html">PCollection<T></a>,
+<a href="http://crunch.apache.org/apidocs/0.10.0/org/apache/crunch/PTable.html">PTable<K, V></a>, and <a href="apidocs/0.10.0/org/apache/crunch/PGroupedTable.html">PGroupedTable<K, V></a>.</p>
 <p>A <code>PCollection&lt;T&gt;</code> represents a distributed, immutable collection of elements of type T. For example, we represent a text file as a
-<code>PCollection&lt;String&gt;</code> object. <code>PCollection&lt;T&gt;</code> provides a method, <em>parallelDo</em>, that applies a <a href="apidocs/0.9.0/org/apache/crunch/DoFn.html">DoFn<T, U></a>
+<code>PCollection&lt;String&gt;</code> object. <code>PCollection&lt;T&gt;</code> provides a method, <em>parallelDo</em>, that applies a <a href="apidocs/0.10.0/org/apache/crunch/DoFn.html">DoFn<T, U></a>
 to each element in the <code>PCollection&lt;T&gt;</code> in parallel, and returns a new <code>PCollection&lt;U&gt;</code> as its result.</p>
 <p>A <code>PTable&lt;K, V&gt;</code> is a sub-interface of <code>PCollection&lt;Pair&lt;K, V&gt;&gt;</code> that represents a distributed, unordered multimap of its key type K to its value type V.
 In addition to the parallelDo operation, PTable provides a <em>groupByKey</em> operation that aggregates all of the values in the PTable that
 have the same key into a single record. It is the groupByKey operation that triggers the sort phase of a MapReduce job. Developers can exercise
 fine-grained control over the number of reducers and the partitioning, grouping, and sorting strategies used during the shuffle by providing an instance
-of the <a href="apidocs/0.9.0/org/apache/crunch/GroupingOptions.html">GroupingOptions</a> class to the <code>groupByKey</code> function.</p>
+of the <a href="apidocs/0.10.0/org/apache/crunch/GroupingOptions.html">GroupingOptions</a> class to the <code>groupByKey</code> function.</p>
 <p>The result of a groupByKey operation is a <code>PGroupedTable&lt;K, V&gt;</code> object, which is a distributed, sorted map of keys of type K to an Iterable<V> that may
 be iterated over exactly once. In addition to <code>parallelDo</code> processing via DoFns, PGroupedTable provides a <em>combineValues</em> operation that allows a
-commutative and associative <a href="apidocs/0.9.0/org/apache/crunch/Aggregator.html">Aggregator<V></a> to be applied to the values of the PGroupedTable
+commutative and associative <a href="apidocs/0.10.0/org/apache/crunch/Aggregator.html">Aggregator<V></a> to be applied to the values of the PGroupedTable
 instance on both the map and reduce sides of the shuffle. A number of common <code>Aggregator&lt;V&gt;</code> implementations are provided in the
-<a href="apidocs/0.9.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> class.</p>
+<a href="apidocs/0.10.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> class.</p>
 <p>Finally, PCollection, PTable, and PGroupedTable all support a <em>union</em> operation, which takes a series of distinct PCollections that all have
 the same data type and treats them as a single virtual PCollection.</p>
 <p>All of the other data transformation operations supported by the Crunch APIs (aggregations, joins, sorts, secondary sorts, and cogrouping) are implemented
-in terms of these four primitives. The patterns themselves are defined in the <a href="apidocs/0.9.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
+in terms of these four primitives. The patterns themselves are defined in the <a href="apidocs/0.10.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
 package and its children, and a few of of the most common patterns have convenience functions defined on the PCollection and PTable interfaces.</p>
-<p>Every Crunch data pipeline is coordinated by an instance of the <a href="apidocs/0.9.0/org/apache/crunch/Pipeline.html">Pipeline</a> interface, which defines
-methods for reading data into a pipeline via <a href="apidocs/0.9.0/org/apache/crunch/Source.html">Source<T></a> instances and writing data out from a
-pipeline to <a href="apidocs/0.9.0/org/apache/crunch/Target.html">Target</a> instances. There are currently three implementations of the Pipeline interface
+<p>Every Crunch data pipeline is coordinated by an instance of the <a href="apidocs/0.10.0/org/apache/crunch/Pipeline.html">Pipeline</a> interface, which defines
+methods for reading data into a pipeline via <a href="apidocs/0.10.0/org/apache/crunch/Source.html">Source<T></a> instances and writing data out from a
+pipeline to <a href="apidocs/0.10.0/org/apache/crunch/Target.html">Target</a> instances. There are currently three implementations of the Pipeline interface
 that are available for developers to use:</p>
 <ol>
-<li><a href="apidocs/0.9.0/org/apache/crunch/impl/mr/MRPipeline.html">MRPipeline</a>: Executes the pipeline as a series of MapReduce jobs.</li>
-<li><a href="apidocs/0.9.0/org/apache/crunch/impl/mem/MemPipeline.html">MemPipeline</a>: Executes the pipeline in-memory on the client.</li>
-<li><a href="apidocs/0.9.0/org/apache/crunch/impl/spark/SparkPipeline.html">SparkPipeline</a>: Executes the pipeline by converting it to a series of Spark pipelines.</li>
+<li><a href="apidocs/0.10.0/org/apache/crunch/impl/mr/MRPipeline.html">MRPipeline</a>: Executes the pipeline as a series of MapReduce jobs.</li>
+<li><a href="apidocs/0.10.0/org/apache/crunch/impl/mem/MemPipeline.html">MemPipeline</a>: Executes the pipeline in-memory on the client.</li>
+<li><a href="apidocs/0.10.0/org/apache/crunch/impl/spark/SparkPipeline.html">SparkPipeline</a>: Executes the pipeline by converting it to a series of Spark pipelines.</li>
 </ol>
 <p><a name="dataproc"></a></p>
 <h2 id="data-processing-with-dofns">Data Processing with DoFns</h2>
@@ -465,7 +465,7 @@ framework won't kill it,</li>
 </ul>
 <p>DoFns also have a number of helper methods for working with <a href="http://codingwiththomas.blogspot.com/2011/04/controlling-hadoop-job-recursion.html">Hadoop Counters</a>, all named <code>increment</code>. Counters are an incredibly useful way of keeping track of the state of long-running data pipelines and detecting any exceptional conditions that
 occur during processing, and they are supported in both the MapReduce-based and in-memory Crunch pipeline contexts. You can retrieve the value of the Counters
-in your client code at the end of a MapReduce pipeline by getting them from the <a href="apidocs/0.9.0/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a>
+in your client code at the end of a MapReduce pipeline by getting them from the <a href="apidocs/0.10.0/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a>
 objects returned by Crunch at the end of a run.</p>
 <ul>
 <li><code>increment(String groupName, String counterName)</code> increments the value of the given counter by 1.</li>
@@ -492,16 +492,16 @@ memory setting for the DoFn's needs befo
 <p><a name="mapfn"></a></p>
 <h3 id="common-dofn-patterns">Common DoFn Patterns</h3>
 <p>The Crunch APIs contain a number of useful subclasses of DoFn that handle common data processing scenarios and are easier
-to write and test. The top-level <a href="apidocs/0.9.0/org/apache/crunch/package-summary.html">org.apache.crunch</a> package contains three
+to write and test. The top-level <a href="apidocs/0.10.0/org/apache/crunch/package-summary.html">org.apache.crunch</a> package contains three
 of the most important specializations, which we will discuss now. Each of these specialized DoFn implementations has associated methods
 on the PCollection, PTable, and PGroupedTable interfaces to support common data processing steps.</p>
-<p>The simplest extension is the <a href="apidocs/0.9.0/org/apache/crunch/FilterFn.html">FilterFn<T></a> class, which defines a single abstract method, <code>boolean accept(T input)</code>.
+<p>The simplest extension is the <a href="apidocs/0.10.0/org/apache/crunch/FilterFn.html">FilterFn<T></a> class, which defines a single abstract method, <code>boolean accept(T input)</code>.
 The FilterFn can be applied to a <code>PCollection&lt;T&gt;</code> by calling the <code>filter(FilterFn&lt;T&gt; fn)</code> method, and will return a new <code>PCollection&lt;T&gt;</code> that only contains
 the elements of the input PCollection for which the accept method returned true. Note that the filter function does not include a PType argument in its
 signature, because there is no change in the data type of the PCollection when the FilterFn is applied. It is possible to compose new FilterFn
 instances by combining multiple FilterFns together using the <code>and</code>, <code>or</code>, and <code>not</code> factory methods defined in the
-<a href="apidocs/0.9.0/org/apache/crunch/fn/FilterFns.html">FilterFns</a> helper class.</p>
-<p>The second extension is the <a href="apidocs/0.9.0/org/apache/crunch/MapFn.html">MapFn<S, T></a> class, which defines a single abstract method, <code>T map(S input)</code>.
+<a href="apidocs/0.10.0/org/apache/crunch/fn/FilterFns.html">FilterFns</a> helper class.</p>
+<p>The second extension is the <a href="apidocs/0.10.0/org/apache/crunch/MapFn.html">MapFn<S, T></a> class, which defines a single abstract method, <code>T map(S input)</code>.
 For simple transform tasks in which every input record will have exactly one output, it's easy to test a MapFn by verifying that a given input returns a
 given output.</p>
 <p>MapFns are also used in specialized methods on the PCollection and PTable interfaces. <code>PCollection&lt;V&gt;</code> defines the method
@@ -510,15 +510,15 @@ function that extracts the key (of type 
 the key be given and constructs a <code>PTableType&lt;K, V&gt;</code> from the given key type and the PCollection's existing value type. <code>PTable&lt;K, V&gt;</code>, in turn,
 has methods <code>PTable&lt;K1, V&gt; mapKeys(MapFn&lt;K, K1&gt; mapFn)</code> and <code>PTable&lt;K, V2&gt; mapValues(MapFn&lt;V, V2&gt;)</code> that handle the common case of converting
 just one of the paired values in a PTable instance from one type to another while leaving the other type the same.</p>
-<p>The final top-level extension to DoFn is the <a href="apidocs/0.9.0/org/apache/crunch/CombineFn.html">CombineFn<K, V></a> class, which is used in conjunction with
+<p>The final top-level extension to DoFn is the <a href="apidocs/0.10.0/org/apache/crunch/CombineFn.html">CombineFn<K, V></a> class, which is used in conjunction with
 the <code>combineValues</code> method defined on the PGroupedTable interface. CombineFns are used to represent the associative operations that can be applied using
 the MapReduce Combiner concept in order to reduce the amount data that is shipped over the network during a shuffle.</p>
 <p>The CombineFn extension is different from the FilterFn and MapFn classes in that it does not define an abstract method for handling data
 beyond the default <code>process</code> method that any other DoFn would use; rather, extending the CombineFn class signals to the Crunch planner that the logic
 contained in this class satisfies the conditions required for use with the MapReduce combiner.</p>
-<p>Crunch supports many types of these associative patterns, such as sums, counts, and set unions, via the <a href="apidocs/0.9.0/org/apache/crunch/Aggregator.html">Aggregator<V></a>
+<p>Crunch supports many types of these associative patterns, such as sums, counts, and set unions, via the <a href="apidocs/0.10.0/org/apache/crunch/Aggregator.html">Aggregator<V></a>
 interface, which is defined right alongside the CombineFn class in the top-level <code>org.apache.crunch</code> package. There are a number of implementations of the Aggregator
-interface defined via static factory methods in the <a href="apidocs/0.9.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> class. We will discuss
+interface defined via static factory methods in the <a href="apidocs/0.10.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> class. We will discuss
 Aggregators more in the section on <a href="#aggregators">common MapReduce patterns</a>.</p>
 <p><a name="serde"></a></p>
 <h2 id="serializing-data-with-ptypes">Serializing Data with PTypes</h2>
@@ -539,11 +539,11 @@ against an existing PCollection, <strong
   }
 </pre>
 
-<p>Crunch supports two different <em>type families</em>, which each implement the <a href="apidocs/0.9.0/org/apache/crunch/types/PTypeFamily.html">PTypeFamily</a> interface:
-one for Hadoop's <a href="apidocs/0.9.0/org/apache/crunch/types/writable/WritableTypeFamily.html">Writable interface</a> and another based on
-<a href="apidocs/0.9.0/org/apache/crunch/types/avro/AvroTypeFamily.html">Apache Avro</a>. There are also classes that contain static factory methods for
-each PTypeFamily to allow for easy import and usage: one for <a href="apidocs/0.9.0/org/apache/crunch/types/writable/Writables.html">Writables</a> and one for
-<a href="apidocs/0.9.0/org/apache/crunch/types/avro/Avros.html">Avros</a>.</p>
+<p>Crunch supports two different <em>type families</em>, which each implement the <a href="apidocs/0.10.0/org/apache/crunch/types/PTypeFamily.html">PTypeFamily</a> interface:
+one for Hadoop's <a href="apidocs/0.10.0/org/apache/crunch/types/writable/WritableTypeFamily.html">Writable interface</a> and another based on
+<a href="apidocs/0.10.0/org/apache/crunch/types/avro/AvroTypeFamily.html">Apache Avro</a>. There are also classes that contain static factory methods for
+each PTypeFamily to allow for easy import and usage: one for <a href="apidocs/0.10.0/org/apache/crunch/types/writable/Writables.html">Writables</a> and one for
+<a href="apidocs/0.10.0/org/apache/crunch/types/avro/Avros.html">Avros</a>.</p>
 <p>The two different type families exist for historical reasons: Writables have long been the standard form for representing serializable data in Hadoop,
 but the Avro based serialization scheme is very compact, fast, and allows for complex record schemas to evolve over time. It's fine (and even encouraged)
 to mix-and-match PCollections that use different PTypes in the same Crunch pipeline (e.g., you could
@@ -580,7 +580,7 @@ can be used to kick off a shuffle on the
 </pre>
 
 <p>If you find yourself in a situation where you have a PCollection<Pair<K, V>&gt; and you need a PTable<K, V>, the
-<a href="apidocs/0.9.0/org/apache/crunch/lib/PTables.html">PTables</a> library class has methods that will do the conversion for you.</p>
+<a href="apidocs/0.10.0/org/apache/crunch/lib/PTables.html">PTables</a> library class has methods that will do the conversion for you.</p>
 <p>Let's look at some more example PTypes created using the common primitive and collection types. For most of your pipelines,
 you will use one type family exclusively, and so you can cut down on some of the boilerplate in your classes by importing
 all of the methods from the <code>Writables</code> or <code>Avros</code> classes into your class:</p>
@@ -648,7 +648,7 @@ includes both Avro generic and specific 
   PType<Record> avroGenericType = Avros.generics(schema);
 </pre>
 
-<p>The <a href="apidocs/0.9.0/org/apache/crunch/types/avro/Avros.html">Avros</a> class also has a <code>reflects</code> method for creating PTypes
+<p>The <a href="apidocs/0.10.0/org/apache/crunch/types/avro/Avros.html">Avros</a> class also has a <code>reflects</code> method for creating PTypes
 for POJOs using Avro's reflection-based serialization mechanism. There are a couple of restrictions on the structure of
 the POJO:</p>
 <ol>
@@ -685,7 +685,7 @@ to query intermediate results to aid in 
 <p>The simplest way to create a new <code>PType&lt;T&gt;</code> for a data object is to create a <em>derived</em> PType from one of the built-in PTypes from the Avro
 and Writable type families. If we have a base <code>PType&lt;S&gt;</code>, we can create a derived <code>PType&lt;T&gt;</code> by implementing an input <code>MapFn&lt;S, T&gt;</code> and an
 output <code>MapFn&lt;T, S&gt;</code> and then calling <code>PTypeFamily.derived(Class&lt;T&gt;, MapFn&lt;S, T&gt; in, MapFn&lt;T, S&gt; out, PType&lt;S&gt; base)</code>, which will return
-a new <code>PType&lt;T&gt;</code>. There are examples of derived PTypes in the <a href="apidocs/0.9.0/org/apache/crunch/types/PTypes.html">PTypes</a> class, including
+a new <code>PType&lt;T&gt;</code>. There are examples of derived PTypes in the <a href="apidocs/0.10.0/org/apache/crunch/types/PTypes.html">PTypes</a> class, including
 serialization support for protocol buffers, Thrift records, Java Enums, BigInteger, and UUIDs. The <a href="https://github.com/kevinweil/elephant-bird/tree/master/crunch">crunch module</a> of <a href="https://github.com/kevinweil/elephant-bird/">Twitter's ElephantBird</a> project also defines PTypes for working with
 protocol buffers and Thrift records that are serialized using ElephantBird's <code>BinaryWritable&lt;T&gt;</code>.</p>
 <p>A common pattern in MapReduce programs is to define a Writable type that wraps a regular Java POJO. You can use derived PTypes to make it
@@ -744,7 +744,7 @@ You use a Source in conjunction with one
       Writables.tableOf(Writables.longs(), Writables.bytes())));
 </pre>
 
-<p>Note that Sources usually require a PType to be specified when they are created. The <a href="apidocs/0.9.0/org/apache/crunch/io/From.html">From</a>
+<p>Note that Sources usually require a PType to be specified when they are created. The <a href="apidocs/0.10.0/org/apache/crunch/io/From.html">From</a>
 class provides a number of factory methods for literate Source creation:</p>
 <pre>
   // Note that we are passing a String "/user/crunch/text", not a Path.
@@ -780,28 +780,28 @@ different files using the NLineInputForm
   </tr>
   <tr>
     <td>Text</td>
-    <td><a href="apidocs/0.9.0/org/apache/crunch/io/text/TextFileSource.html">org.apache.crunch.io.text.TextFileSource</a></td>
+    <td><a href="apidocs/0.10.0/org/apache/crunch/io/text/TextFileSource.html">org.apache.crunch.io.text.TextFileSource</a></td>
     <td>PCollection&lt;String&gt;</td>
     <td>textFile</td>
     <td>Works for both TextInputFormat and AvroUtf8InputFormat</td>
   </tr>
   <tr>
     <td>Sequence</td>
-    <td><a href="apidocs/0.9.0/org/apache/crunch/io/seq/SeqFileTableSource.html">org.apache.crunch.io.seq.SeqFileTableSource</a></td>
+    <td><a href="apidocs/0.10.0/org/apache/crunch/io/seq/SeqFileTableSource.html">org.apache.crunch.io.seq.SeqFileTableSource</a></td>
     <td>PTable&lt;K, V&gt;</td>
     <td>sequenceFile</td>
-    <td>Also has a <a href="apidocs/0.9.0/org/apache/crunch/io/seq/SeqFileSource.html">SeqFileSource</a> which reads the value and ignores the key.</td>
+    <td>Also has a <a href="apidocs/0.10.0/org/apache/crunch/io/seq/SeqFileSource.html">SeqFileSource</a> which reads the value and ignores the key.</td>
   </tr>
   <tr>
     <td>Avro</td>
-    <td><a href="apidocs/0.9.0/org/apache/crunch/io/avro/AvroFileSource.html">org.apache.crunch.io.avro.AvroFileSource</a></td>
+    <td><a href="apidocs/0.10.0/org/apache/crunch/io/avro/AvroFileSource.html">org.apache.crunch.io.avro.AvroFileSource</a></td>
     <td>PCollection&lt;V&gt;</td>
     <td>avroFile</td>
     <td>No PTable analogue for Avro records.</td>
   </tr>
   <tr>
     <td>Parquet</td>
-    <td><a href="apidocs/0.9.0/org/apache/crunch/io/parquet/AvroParquetFileSource.html">org.apache.crunch.io.parquet.AvroParquetFileSource</a></td>
+    <td><a href="apidocs/0.10.0/org/apache/crunch/io/parquet/AvroParquetFileSource.html">org.apache.crunch.io.parquet.AvroParquetFileSource</a></td>
     <td>PCollection&lt;V&gt;</td>
     <td>N/A</td>
     <td>Reads Avro records from a parquet-formatted file; expects an Avro PType.</td>
@@ -826,7 +826,7 @@ defined on the <code>Pipeline</code> int
 </pre>
 
 <p>Just as the Source interface has the <code>From</code> class of factory methods, Target factory methods are defined in a class named
-<a href="apidocs/0.9.0/org/apache/crunch/io/To.html">To</a> to enable literate programming:</p>
+<a href="apidocs/0.10.0/org/apache/crunch/io/To.html">To</a> to enable literate programming:</p>
 <pre>
   lines.write(To.textFile("/user/crunch/textout"));
 </pre>
@@ -856,25 +856,25 @@ parameters that this Target needs:</p>
   </tr>
   <tr>
     <td>Text</td>
-    <td><a href="apidocs/0.9.0/org/apache/crunch/io/text/TextFileTarget.html">org.apache.crunch.io.text.TextFileTarget</a></td>
+    <td><a href="apidocs/0.10.0/org/apache/crunch/io/text/TextFileTarget.html">org.apache.crunch.io.text.TextFileTarget</a></td>
     <td>textFile</td>
     <td>Will write out the string version of whatever it's given, which should be text. See also: Pipeline.writeTextFile.</td>
   </tr>
   <tr>
     <td>Sequence</td>
-    <td><a href="apidocs/0.9.0/org/apache/crunch/io/seq/SeqFileTarget.html">org.apache.crunch.io.seq.SeqFileTarget</a></td>
+    <td><a href="apidocs/0.10.0/org/apache/crunch/io/seq/SeqFileTarget.html">org.apache.crunch.io.seq.SeqFileTarget</a></td>
     <td>sequenceFile</td>
     <td>Works on both PCollection and PTable.</td>
   </tr>
   <tr>
     <td>Avro</td>
-    <td><a href="apidocs/0.9.0/org/apache/crunch/io/avro/AvroFileTarget.html">org.apache.crunch.io.avro.AvroFileTarget</a></td>
+    <td><a href="apidocs/0.10.0/org/apache/crunch/io/avro/AvroFileTarget.html">org.apache.crunch.io.avro.AvroFileTarget</a></td>
     <td>avroFile</td>
     <td>Treats PTables as PCollections of Pairs.</td>
   </tr>
   <tr>
     <td>Parquet</td>
-    <td><a href="apidocs/0.9.0/org/apache/crunch/io/parquet/AvroParquetFileTarget.html">org.apache.crunch.io.parquet.AvroParquetFileTarget</a></td>
+    <td><a href="apidocs/0.10.0/org/apache/crunch/io/parquet/AvroParquetFileTarget.html">org.apache.crunch.io.parquet.AvroParquetFileTarget</a></td>
     <td>N/A</td>
     <td>Writes Avro records to parquet-formatted files; expects an Avro PType.</td>
   </tr>
@@ -885,13 +885,13 @@ parameters that this Target needs:</p>
 <p>The <code>SourceTarget&lt;T&gt;</code> interface extends both the <code>Source&lt;T&gt;</code> and <code>Target</code> interfaces and allows a Path to act as both a
 Target for some PCollections as well as a Source for others. SourceTargets are convenient for any intermediate outputs within
 your pipeline. Just as we have the factory methods in the From and To classes for Sources and Targets, factory methods for
-SourceTargets are declared in the <a href="apidocs/0.9.0/org/apache/crunch/io/At.html">At</a> class.</p>
+SourceTargets are declared in the <a href="apidocs/0.10.0/org/apache/crunch/io/At.html">At</a> class.</p>
 <p>In many pipeline applications, we want to control how any existing files in our target paths are handled by Crunch. For example,
 we might want the pipeline to fail quickly if an output path already exists, or we might want to delete the existing files
 and overwrite them with our new outputs. We might also want to use an output path as a <em>checkpoint</em> for our data pipeline.
 Checkpoints allow us to specify that a Path should be used as the starting location for our pipeline execution if the data
 it contains is newer than the data in the paths associated with any upstream inputs to that output location.</p>
-<p>Crunch supports these different output options via the <a href="apidocs/0.9.0/org/apache/crunch/Target.WriteMode.html">WriteMode</a> enum,
+<p>Crunch supports these different output options via the <a href="apidocs/0.10.0/org/apache/crunch/Target.WriteMode.html">WriteMode</a> enum,
 which can be passed along with a Target to the <code>write</code> method on either PCollection or Pipeline. Here are the supported
 WriteModes for Crunch:</p>
 <pre>
@@ -928,19 +928,19 @@ the Iterable returned by <code>Iterable&
 one of the <code>run</code> methods on the Pipeline interface that are used to manage overall pipeline execution. This means that you can instruct
 Crunch to materialize multiple PCollections and have them all created within a single Pipeline run.</p>
 <p>If you ask Crunch to materialize a PCollection that is returned from Pipeline's <code>PCollection&lt;T&gt; read(Source&lt;T&gt; source)</code> method, then no
-MapReduce job will be executed if the given Source implements the <a href="apidocs/0.9.0/org/apache/crunch/io/ReadableSource.html">ReadableSource</a>
+MapReduce job will be executed if the given Source implements the <a href="apidocs/0.10.0/org/apache/crunch/io/ReadableSource.html">ReadableSource</a>
 interface. If the Source is not readable, then a map-only job will be executed to map the data to a format that Crunch knows how to
 read from disk.</p>
 <p>Sometimes, the output of a Crunch pipeline will be a single value, such as the number of elements in a PCollection. In other instances,
 you may want to perform some additional client-side computations on the materialized contents of a PCollection in a way that is
-transparent to users of your libraries. For these situations, Crunch defines a <a href="apidocs/0.9.0/org/apache/crunch/PObject.html">PObject<V></a>
+transparent to users of your libraries. For these situations, Crunch defines a <a href="apidocs/0.10.0/org/apache/crunch/PObject.html">PObject<V></a>
 interface that has an associated <code>V getValue()</code> method. PCollection's <code>PObject&lt;Long&gt; length()</code> method returns a reference to the number
 of elements contained in that PCollection, but the pipeline tasks required to compute this value will not run until the <code>Long getValue()</code>
 method of the returned PObject is called.</p>
 <p><a name="patterns"></a></p>
 <h2 id="data-processing-patterns-in-crunch">Data Processing Patterns in Crunch</h2>
 <p>This section describes the various data processing patterns implemented in Crunch's library APIs,
-which are in the <a href="apidocs/0.9.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
+which are in the <a href="apidocs/0.10.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
 package.</p>
 <p><a name="gbk"></a></p>
 <h3 id="groupbykey">groupByKey</h3>
@@ -955,7 +955,7 @@ explicitly provided by the developer bas
 <li><code>groupByKey(GroupingOptions options)</code>: Complex shuffle operations that require custom partitions
 and comparators.</li>
 </ol>
-<p>The <a href="apidocs/0.9.0/org/apache/crunch/GroupingOptions.html">GroupingOptions</a> class allows developers
+<p>The <a href="apidocs/0.10.0/org/apache/crunch/GroupingOptions.html">GroupingOptions</a> class allows developers
 to exercise precise control over how data is partitioned, sorted, and grouped by the underlying
 execution engine. Crunch was originally developed on top of MapReduce, and so the GroupingOptions APIs
 expect instances of Hadoop's <a href="http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Partitioner.html">Partitioner</a>
@@ -963,7 +963,7 @@ and <a href="http://hadoop.apache.org/do
 classes in order to support partitions and sorts. That said, Crunch has adapters in place so that these
 same classes may also be used with other execution engines, like Apache Spark, without a rewrite.</p>
 <p>The GroupingOptions class is immutable; to create a new one, take advantage of the
-<a href="apidocs/0.9.0/org/apache/crunch/GroupingOptions.Builder.html">GroupingOptions.Builder</a> implementation.</p>
+<a href="apidocs/0.10.0/org/apache/crunch/GroupingOptions.Builder.html">GroupingOptions.Builder</a> implementation.</p>
 <pre>
   GroupingOptions opts = GroupingOptions.builder()
       .groupingComparatorClass(MyGroupingComparator.class)
@@ -985,10 +985,10 @@ pipeline.</p>
 <p>Calling one of the groupByKey methods on PTable returns an instance of the PGroupedTable interface.
 PGroupedTable provides a <code>combineValues</code> that can be used to signal to the planner that we want to perform
 associative aggregations on our data both before and after the shuffle.</p>
-<p>There are two ways to use combineValues: you can create an extension of the <a href="apidocs/0.9.0/org/apache/crunch/CombineFn.html">CombineFn</a>
-abstract base class, or you can use an instance of the <a href="apidocs/0.9.0/org/apache/crunch/Aggregator.html">Aggregator<V></a>
+<p>There are two ways to use combineValues: you can create an extension of the <a href="apidocs/0.10.0/org/apache/crunch/CombineFn.html">CombineFn</a>
+abstract base class, or you can use an instance of the <a href="apidocs/0.10.0/org/apache/crunch/Aggregator.html">Aggregator<V></a>
 interface. Of the two, an Aggregator is probably the way you want to go; Crunch provides a number of
-<a href="0.9.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a>, and they are a bit easier to write and compose together.
+<a href="0.10.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a>, and they are a bit easier to write and compose together.
 Let's walk through a few example aggregations:</p>
 <pre>
   PTable&lt;String, Double&gt; data = ...;
@@ -1029,7 +1029,7 @@ the average of a set of values:</p>
 <h3 id="simple-aggregations">Simple Aggregations</h3>
 <p>Many of the most common aggregation patterns in Crunch are provided as methods on the PCollection
 interface, including <code>count</code>, <code>max</code>, <code>min</code>, and <code>length</code>. The implementations of these methods,
-however, are in the <a href="apidocs/0.9.0/org/apache/crunch/lib/Aggregate.html">Aggregate</a> library class.
+however, are in the <a href="apidocs/0.10.0/org/apache/crunch/lib/Aggregate.html">Aggregate</a> library class.
 The methods in the Aggregate class expose some additional options that you can use for performing
 aggregations, such as controlling the level of parallelism for count operations:</p>
 <pre>
@@ -1050,9 +1050,9 @@ most frequently occuring elements, you w
 <p><a name="joins"></a></p>
 <h3 id="joining-data">Joining Data</h3>
 <p>Joins in Crunch are based on equal-valued keys in different PTables. Joins have also evolved
-a great deal in Crunch over the lifetime of the project. The <a href="apidocs/0.9.0/org/apache/crunch/lib/Join.html">Join</a>
+a great deal in Crunch over the lifetime of the project. The <a href="apidocs/0.10.0/org/apache/crunch/lib/Join.html">Join</a>
 API provides simple methods for performing equijoins, left joins, right joins, and full joins, but modern
-Crunch joins are usually performed using an explicit implementation of the <a href="apidocs/0.9.0/org/apache/crunch/lib/join/JoinStrategy.html">JoinStrategy</a>
+Crunch joins are usually performed using an explicit implementation of the <a href="apidocs/0.10.0/org/apache/crunch/lib/join/JoinStrategy.html">JoinStrategy</a>
 interface, which has support for the same rich set of joins that you can use in tools like Apache Hive and
 Apache Pig.</p>
 <p>All of the algorithms discussed below implement the JoinStrategy interface, which defines a single join method:</p>
@@ -1063,36 +1063,45 @@ Apache Pig.</p>
   PTable&lt;K, Pair&lt;V1, V2&gt;&gt; joined = strategy.join(one, two, JoinType);
 </pre>
 
-<p>The <a href="apidocs/0.9.0/org/apache/crunch/lib/join/JoinType.html">JoinType</a> enum determines which
+<p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/join/JoinType.html">JoinType</a> enum determines which
 kind of join is applied: inner, outer, left, right, or full. In general, the smaller of the two
-inputs should be the left-most argument to the join method. The only exception to this (for unfortunate
-historical reasons that the Crunch developers deeply apologize for) is for mapside-joins, where the
-left-most argument should be the <em>larger</em> input.</p>
+inputs should be the left-most argument to the join method.</p>
+<p>Note that the values of the PTables you join should be non-null. The join
+algorithms in Crunch use null as a placeholder to represent that there are no values for
+a given key in a PCollection, so joining PTables that contain null values may have
+surprising results. Using a non-null dummy value in your PCollections is a good idea in
+general.</p>
 <p><a name="reducejoin"></a></p>
 <h4 id="reduce-side-joins">Reduce-side Joins</h4>
-<p>Reduce-side joins are handled by the <a href="apidocs/0.9.0/org/apache/crunch/lib/join/DefaultJoinStrategy.html">DefaultJoinStrategy</a>.
+<p>Reduce-side joins are handled by the <a href="apidocs/0.10.0/org/apache/crunch/lib/join/DefaultJoinStrategy.html">DefaultJoinStrategy</a>.
 Reduce-side joins are the simplest and most robust kind of joins in Hadoop; the keys from the two inputs are
 shuffled together to the reducers, where the values from the smaller of the two collections are collected and then
 streamed over the values from the larger of the two collections. You can control the number of reducers that is used
 to perform the join by passing an integer argument to the DefaultJoinStrategy constructor.</p>
 <p><a name="mapjoin"></a></p>
 <h4 id="map-side-joins">Map-side Joins</h4>
-<p>Map-side joins are handled by the <a href="apidocs/0.9.0/org/apache/crunch/lib/join/MapsideJoinStrategy.html">MapsideJoinStrategy</a>.
+<p>Map-side joins are handled by the <a href="apidocs/0.10.0/org/apache/crunch/lib/join/MapsideJoinStrategy.html">MapsideJoinStrategy</a>.
 Map-side joins require that the smaller of the two input tables is loaded into memory on the tasks on the cluster, so
 there is a requirement that at least one of the tables be relatively small so that it can comfortably fit into memory within
-each task. <em>Remember, the MapsideJoinStrategy is the only JoinStrategy implementation where the left-most argument should
-be larger than the right-most one.</em></p>
+each task.</p>
+<p>For a long time, the MapsideJoinStrategy differed from the rest of the JoinStrategy
+implementations in that the left-most argument was intended to be larger than the right-side
+one, since the right-side PTable was loaded into memory. Since Crunch 0.10.0/0.8.3, we
+have deprecated the old MapsideJoinStrategy constructor which had the sizes reversed and
+recommend that you use the <code>MapsideJoinStrategy.create()</code> factory method, which returns an
+implementation of the MapsideJoinStrategy in which the left-side PTable is loaded into
+memory instead of the right-side PTable.</p>
 <p><a name="shardedjoin"></a></p>
 <h4 id="sharded-joins">Sharded Joins</h4>
 <p>Many distributed joins have skewed data that can cause regular reduce-side joins to fail due to out-of-memory issues on
 the partitions that happen to contain the keys with highest cardinality. To handle these skew issues, Crunch has the
-<a href="apidocs/0.9.0/org/apache/crunch/lib/join/ShardedJoinStrategy.html">ShardedJoinStrategy</a> that allows developers to shard
+<a href="apidocs/0.10.0/org/apache/crunch/lib/join/ShardedJoinStrategy.html">ShardedJoinStrategy</a> that allows developers to shard
 each key to multiple reducers, which prevents a few reducers from getting overloaded with the values from the skewed keys
 in exchange for sending more data over the wire. For problems with significant skew issues, the ShardedJoinStrategy can
 significantly improve performance.</p>
 <p><a name="bloomjoin"></a></p>
 <h4 id="bloom-filter-joins">Bloom Filter Joins</h4>
-<p>Last but not least, the <a href="apidocs/0.9.0/org/apache/crunch/lib/join/BloomFilterJoinStrategy.html">BloomFilterJoinStrategy</a> builds
+<p>Last but not least, the <a href="apidocs/0.10.0/org/apache/crunch/lib/join/BloomFilterJoinStrategy.html">BloomFilterJoinStrategy</a> builds
 a <a href="http://en.wikipedia.org/wiki/Bloom_filter">bloom filter</a> on the left-hand side table that is used to filter the contents
 of the right-hand side table to eliminate entries from the (larger) right-hand side table that have no hope of being joined
 to values in the left-hand side table. This is useful in situations in which the left-hand side table is too large to fit
@@ -1104,7 +1113,7 @@ vast majority of the keys in the right-h
 For example, we might want to join two datasets
 together and only emit a record if each of the sets had at least two distinct values associated
 with each key. For arbitrary complex join logic, we can always fall back to the
-<a href="apidocs/0.9.0/org/apache/crunch/lib/Cogroup.html">Cogroup</a> API, which takes in an arbitrary number
+<a href="apidocs/0.10.0/org/apache/crunch/lib/Cogroup.html">Cogroup</a> API, which takes in an arbitrary number
 of PTable instances that all have the same key type and combines them together into a single
 PTable whose values are made up of Collections of the values from each of the input PTables.</p>
 <pre>
@@ -1130,7 +1139,7 @@ Crunch APIs have a number of utilities f
 more advanced patterns like secondary sorts.</p>
 <p><a name="stdsort"></a></p>
 <h4 id="standard-and-reverse-sorting">Standard and Reverse Sorting</h4>
-<p>The <a href="apidocs/0.9.0/org/apache/crunch/lib/Sort.html">Sort</a> API methods contain utility functions
+<p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Sort.html">Sort</a> API methods contain utility functions
 for sorting the contents of PCollections and PTables whose contents implement the <code>Comparable</code>
 interface. By default, MapReduce does not perform total sorts on its keys during a shuffle; instead
 a sort is done locally on each of the partitions of the data that are sent to each reducer. Doing
@@ -1151,7 +1160,7 @@ total order partitioner and sorting cont
 
 <p>For more complex PCollections or PTables that are made up of Tuples (Pairs, Tuple3, etc.), we can
 specify which columns of the Tuple should be used for sorting the contents, and in which order, using
-the <a href="apidocs/0.9.0/org/apache/crunch/lib/Sort.ColumnOrder.html">ColumnOrder</a> class:</p>
+the <a href="apidocs/0.10.0/org/apache/crunch/lib/Sort.ColumnOrder.html">ColumnOrder</a> class:</p>
 <pre>
   PTable&lt;String, Long&gt; table = ...;
   // Sorted by value, instead of key -- remember, a PTable is a PCollection of Pairs.
@@ -1162,7 +1171,7 @@ the <a href="apidocs/0.9.0/org/apache/cr
 <h4 id="secondary-sorts">Secondary Sorts</h4>
 <p>Another pattern that occurs frequently in distributed processing is <em>secondary sorts</em>, where we
 want to group a set of records by one key and sort the records within each group by a second key.
-The <a href="apidocs/0.9.0/org/apache/crunch/lib/SecondarySort.html">SecondarySort</a> API provides a set
+The <a href="apidocs/0.10.0/org/apache/crunch/lib/SecondarySort.html">SecondarySort</a> API provides a set
 of <code>sortAndApply</code> methods that can be used on input PTables of the form <code>PTable&lt;K, Pair&lt;K2, V&gt;&gt;</code>,
 where <code>K</code> is the primary grouping key and <code>K2</code> is the secondary grouping key. The <code>sortAndApply</code>
 method will perform the grouping and sorting and will then apply a given DoFn to process the
@@ -1177,7 +1186,7 @@ techniques throughout its library APIs.<
 one of the datasets to be small enough to fit into memory, and then do a pass over the larger data
 set where we emit an element of the smaller data set along with each element from the larger set.</p>
 <p>When this pattern isn't possible but we still need to take the cartesian product, we have some options,
-but they're fairly expensive. Crunch's <a href="apidocs/0.9.0/org/apache/crunch/lib/Cartesian.html">Cartesian</a> API
+but they're fairly expensive. Crunch's <a href="apidocs/0.10.0/org/apache/crunch/lib/Cartesian.html">Cartesian</a> API
 provides methods for a reduce-side full cross product between two PCollections (or PTables.) Note that
 this is a pretty expensive operation, and you should go out of your way to avoid these kinds of processing
 steps in your pipelines.</p>
@@ -1185,7 +1194,7 @@ steps in your pipelines.</p>
 <h4 id="coalescing">Coalescing</h4>
 <p>Many MapReduce jobs have the potential to generate a large number of small files that could be used more
 effectively by clients if they were all merged together into a small number of large files. The
-<a href="apidocs/0.9.0/org/apache/crunch/lib/Shard.html">Shard</a> API provides a single method, <code>shard</code>, that allows
+<a href="apidocs/0.10.0/org/apache/crunch/lib/Shard.html">Shard</a> API provides a single method, <code>shard</code>, that allows
 you to coalesce a given PCollection into a fixed number of partitions:</p>
 <pre>
   PCollection&lt;Long&gt; data = ...;
@@ -1196,7 +1205,7 @@ you to coalesce a given PCollection into
 partitions. This is often a useful step at the end of a long pipeline run.</p>
 <p><a name="distinct"></a></p>
 <h4 id="distinct">Distinct</h4>
-<p>Crunch's <a href="apidocs/0.9.0/org/apache/crunch/lib/Distinct.html">Distinct</a> API has a method, <code>distinct</code>, that
+<p>Crunch's <a href="apidocs/0.10.0/org/apache/crunch/lib/Distinct.html">Distinct</a> API has a method, <code>distinct</code>, that
 returns one copy of each unique element in a given PCollection:</p>
 <pre>
   PCollection&lt;Long&gt; data = ...;
@@ -1218,7 +1227,7 @@ value for your own pipelines. The optima
 thus the amount of memory they consume) and the number of unique elements in the data.</p>
 <p><a name="sampling"></a></p>
 <h4 id="sampling">Sampling</h4>
-<p>The <a href="apidocs/0.9.0/org/apache/crunch/lib/Sample.html">Sample</a> API provides methods for two sorts of PCollection
+<p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Sample.html">Sample</a> API provides methods for two sorts of PCollection
 sampling: random and reservoir.</p>
 <p>Random sampling is where you include each record in the same with a fixed probability, and is probably what you're
 used to when you think of sampling from a collection:</p>
@@ -1244,13 +1253,13 @@ random number generators. Note that all 
 only require a single pass over the data.</p>
 <p><a name="sets"></a></p>
 <h4 id="set-operations">Set Operations</h4>
-<p>The <a href="apidocs/0.9.0/org/apache/crunch/lib/Set.html">Set</a> API methods complement Crunch's built-in <code>union</code> methods and
+<p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Set.html">Set</a> API methods complement Crunch's built-in <code>union</code> methods and
 provide support for finding the intersection, the difference, or the <a href="http://en.wikipedia.org/wiki/Comm">comm</a> of two PCollections.</p>
 <p><a name="splits"></a></p>
 <h4 id="splits">Splits</h4>
 <p>Sometimes, you want to write two different outputs from the same DoFn into different PCollections. An example of this would
 be a pipeline in which you wanted to write good records to one file and bad or corrupted records to a different file for
-further examination. The <a href="apidocs/0.9.0/org/apache/crunch/lib/Channels.html">Channels</a> class provides a method that allows
+further examination. The <a href="apidocs/0.10.0/org/apache/crunch/lib/Channels.html">Channels</a> class provides a method that allows
 you to split an input PCollection of Pairs into a Pair of PCollections:</p>
 <pre>
   PCollection&lt;Pair&lt;L, R&gt;&gt; in = ...;
@@ -1320,31 +1329,31 @@ the maximum value encountered would be i
 flexible schemas for PCollections and PTables, you can write pipelines that operate directly on HBase API classes like
 <code>Put</code>, <code>KeyValue</code>, and <code>Result</code>.</p>
 <p>Be sure that the version of Crunch that you're using is compatible with the version of HBase that you are running. The 0.8.x
-Crunch versions and earlier ones are developed against HBase 0.94.x, while version 0.9.0 and after are developed against
+Crunch versions and earlier ones are developed against HBase 0.94.x, while version 0.10.0 and after are developed against
 HBase 0.96. There were a small number of backwards-incompatible changes made between HBase 0.94 and 0.96 that are reflected
 in the Crunch APIs for working with HBase. The most important of these is that in HBase 0.96, HBase's <code>Put</code>, <code>KeyValue</code>, and <code>Result</code>
-classes no longer implement the Writable interface. To support working with these types in Crunch 0.9.0, we added the
-<a href="apidocs/0.9.0/org/apache/crunch/io/hbase/HBaseTypes.html">HBaseTypes</a> class that has factory methods for creating PTypes that serialize the HBase client classes to bytes so
+classes no longer implement the Writable interface. To support working with these types in Crunch 0.10.0, we added the
+<a href="apidocs/0.10.0/org/apache/crunch/io/hbase/HBaseTypes.html">HBaseTypes</a> class that has factory methods for creating PTypes that serialize the HBase client classes to bytes so
 that they can still be used as part of MapReduce pipelines.</p>
-<p>Crunch supports working with HBase data in two ways. The <a href="apidocs/0.9.0/org/apache/crunch/io/hbase/HBaseSourceTarget.html">HBaseSourceTarget</a> and <a href="apidocs/0.9.0/org/apache/crunch/io/hbase/HBaseTarget.html">HBaseTarget</a> classes support reading and
-writing data to HBase tables directly. The <a href="apidocs/0.9.0/org/apache/crunch/io/hbase/HFileSource.html">HFileSource</a> and <a href="apidocs/0.9.0/org/apache/crunch/io/hbase/HFileTarget.html">HFileTarget</a> classes support reading and writing data
+<p>Crunch supports working with HBase data in two ways. The <a href="apidocs/0.10.0/org/apache/crunch/io/hbase/HBaseSourceTarget.html">HBaseSourceTarget</a> and <a href="apidocs/0.10.0/org/apache/crunch/io/hbase/HBaseTarget.html">HBaseTarget</a> classes support reading and
+writing data to HBase tables directly. The <a href="apidocs/0.10.0/org/apache/crunch/io/hbase/HFileSource.html">HFileSource</a> and <a href="apidocs/0.10.0/org/apache/crunch/io/hbase/HFileTarget.html">HFileTarget</a> classes support reading and writing data
 to hfiles, which are the underlying file format for HBase. HFileSource and HFileTarget can be used to read and write data to
 hfiles directly, which is much faster than going through the HBase APIs and can be used to perform efficient bulk loading of data
-into HBase tables. See the utility methods in the <a href="apidocs/0.9.0/org/apache/crunch/io/hbase/HFileUtils.html">HFileUtils</a> class for
+into HBase tables. See the utility methods in the <a href="apidocs/0.10.0/org/apache/crunch/io/hbase/HFileUtils.html">HFileUtils</a> class for
 more details on how to work with PCollections against hfiles.</p>
 <p><a name="exec"></a></p>
 <h2 id="managing-pipeline-execution">Managing Pipeline Execution</h2>
 <p>Crunch uses a lazy execution model. No jobs are run or outputs created until the user explicitly invokes one of the methods on the
 Pipeline interface that controls job planning and execution. The simplest of these methods is the <code>PipelineResult run()</code> method,
 which analyzes the current graph of PCollections and Target outputs and comes up with a plan to ensure that each of the outputs is
-created and then executes it, returning only when the jobs are completed. The <a href="apidocs/0.9.0/org/apache/crunch/PipelineResult.html">PipelineResult</a>
+created and then executes it, returning only when the jobs are completed. The <a href="apidocs/0.10.0/org/apache/crunch/PipelineResult.html">PipelineResult</a>
 returned by the <code>run</code> method contains information about what was run, including the number of jobs that were executed during the
-pipeline run and the values of the Hadoop Counters for each of those stages via the <a href="apidocs/0.9.0/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a> component classes.</p>
+pipeline run and the values of the Hadoop Counters for each of those stages via the <a href="apidocs/0.10.0/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a> component classes.</p>
 <p>The last method that should be called in <em>any</em> Crunch pipeline run is the Pipeline interface's <code>PipelineResult done()</code> method. The done method will
 ensure that any remaining outputs that have not yet been created are executed via the <code>run</code>, and it will clean up the temporary directories that
 Crunch creates during runs to hold serialized job information and intermediate outputs.</p>
 <p>Crunch also allows developers to execute finer-grained control over pipeline execution via Pipeline's <code>PipelineExecution runAsync()</code> method.
-The <code>runAsync</code> method is a non-blocking version of the <code>run</code> method that returns a <a href="apidocs/0.9.0/org/apache/crunch/PipelineExecution.html">PipelineExecution</a> instance that can be used to monitor the currently running Crunch pipeline. The PipelineExecution object is also useful for debugging
+The <code>runAsync</code> method is a non-blocking version of the <code>run</code> method that returns a <a href="apidocs/0.10.0/org/apache/crunch/PipelineExecution.html">PipelineExecution</a> instance that can be used to monitor the currently running Crunch pipeline. The PipelineExecution object is also useful for debugging
 Crunch pipelines by visualizing the Crunch execution plan in DOT format via its <code>String getPlanDotFile()</code> method. PipelineExection implements
 Guava's <a href="https://code.google.com/p/guava-libraries/wiki/ListenableFutureExplained">ListenableFuture</a>, so you can attach handlers that will be
 called when your pipeline finishes executing.</p>
@@ -1360,7 +1369,7 @@ execution pipelines in a way that is exp
 the different execution engines.</p>
 <p><a name="mrpipeline"></a></p>
 <h3 id="mrpipeline">MRPipeline</h3>
-<p>The <a href="apidocs/0.9.0/org/apache/crunch/impl/mr/MRPipeline.html">MRPipeline</a> is the oldest implementation of the Pipeline interface and
+<p>The <a href="apidocs/0.10.0/org/apache/crunch/impl/mr/MRPipeline.html">MRPipeline</a> is the oldest implementation of the Pipeline interface and
 compiles and executes the DAG of PCollections into a series of MapReduce jobs. MRPipeline has three constructors that are commonly
 used:</p>
 <ol>
@@ -1420,7 +1429,7 @@ aware of:</p>
 
 <p><a name="sparkpipeline"></a></p>
 <h3 id="sparkpipeline">SparkPipeline</h3>
-<p>The <code>SparkPipeline</code> is the newest implementation of the Pipeline interface, and was added in Crunch 0.9.0. It has two default constructors:</p>
+<p>The <code>SparkPipeline</code> is the newest implementation of the Pipeline interface, and was added in Crunch 0.10.0. It has two default constructors:</p>
 <ol>
 <li><code>SparkPipeline(String sparkConnection, String appName)</code> which takes a Spark connection string, which is of the form <code>local[numThreads]</code> for
 local mode or <code>master:port</code> for a Spark cluster. This constructor will create its own <code>JavaSparkContext</code> instance to control the Spark pipeline
@@ -1446,7 +1455,7 @@ be a little rough around the edges and m
 actively working to ensure complete compatibility between the two implementations.</p>
 <p><a name="mempipeline"></a></p>
 <h3 id="mempipeline">MemPipeline</h3>
-<p>The <a href="apidocs/0.9.0/org/apache/crunch/impl/mem/MemPipeline.html">MemPipeline</a> implementation of Pipeline has a few interesting
+<p>The <a href="apidocs/0.10.0/org/apache/crunch/impl/mem/MemPipeline.html">MemPipeline</a> implementation of Pipeline has a few interesting
 properties. First, unlike MRPipeline, MemPipeline is a singleton; you don't create a MemPipeline, you just get a reference to it
 via the static <code>MemPipeline.getInstance()</code> method. Second, all of the operations in the MemPipeline are executed completely in-memory,
 there is no serialization of data to disk by default, and PType usage is fairly minimal. This has both benefits and drawbacks; on
@@ -1483,9 +1492,9 @@ without writing them out to disk.</p>
 interface has several tools to help developers create effective unit tests, which will be detailed in this section.</p>
 <h3 id="unit-testing-dofns">Unit Testing DoFns</h3>
 <p>Many of the DoFn implementations, such as <code>MapFn</code> and <code>FilterFn</code>, are very easy to test, since they accept a single input
-and return a single output. For general purpose DoFns, we need an instance of the <a href="apidocs/0.9.0/org/apache/crunch/Emitter.html">Emitter</a>
+and return a single output. For general purpose DoFns, we need an instance of the <a href="apidocs/0.10.0/org/apache/crunch/Emitter.html">Emitter</a>
 interface that we can pass to the DoFn's <code>process</code> method and then read in the values that are written by the function. Support
-for this pattern is provided by the <a href="apidocs/0.9.0/org/apache/crunch/impl/mem/emit/InMemoryEmitter.html">InMemoryEmitter</a> class, which
+for this pattern is provided by the <a href="apidocs/0.10.0/org/apache/crunch/impl/mem/emit/InMemoryEmitter.html">InMemoryEmitter</a> class, which
 has a <code>List&lt;T&gt; getOutput()</code> method that can be used to read the values that were passed to the Emitter instance by a DoFn instance:</p>
 <div class="codehilite"><pre><span class="p">@</span><span class="n">Test</span>
 <span class="n">public</span> <span class="n">void</span> <span class="n">testToUpperCaseFn</span><span class="p">()</span> <span class="p">{</span>



Mime
View raw message