crunch-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r972840 - in /websites/staging/crunch/trunk/content: ./ about.html bylaws.html download.html future-work.html getting-started.html index.html mailing-lists.html pipelines.html scrunch.html source-repository.html user-guide.html
Date Wed, 18 Nov 2015 11:49:54 GMT
Author: buildbot
Date: Wed Nov 18 11:49:54 2015
New Revision: 972840

Staging update by buildbot for crunch

    websites/staging/crunch/trunk/content/   (props changed)

Propchange: websites/staging/crunch/trunk/content/
--- cms:source-revision (original)
+++ cms:source-revision Wed Nov 18 11:49:54 2015
@@ -1 +1 @@

Modified: websites/staging/crunch/trunk/content/about.html
--- websites/staging/crunch/trunk/content/about.html (original)
+++ websites/staging/crunch/trunk/content/about.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
-          <p>The initial source code of the Apache Crunch project has been written mostly
+          <style type="text/css">
+/* The following code is added by
+   It was originally lifted from */
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>The initial source code of the Apache Crunch project has been written mostly
 by Josh Wills at <a href="">Cloudera</a> in 2011, based on
 Google's FlumeJava library. The project was open sourced at GitHub soon
 afterwards where serveral releases up to and including 0.2.4 were made.</p>
@@ -154,7 +165,7 @@ entered the <a href="http://incubator.ap
 the Incubator and three releases (0.3.0-incubating to 0.5.0-incubating), the
 Apache Board of Directors established the Apache Crunch project in February
 2013 as a new top level project.</p>
-<h2 id="team">Team</h2>
+<h2 id="team">Team<a class="headerlink" href="#team" title="Permanent link">&para;</a></h2>
   Markdown-generated tables don't have the proper CSS classes,
   so we use plain HTML tables.

Modified: websites/staging/crunch/trunk/content/bylaws.html
--- websites/staging/crunch/trunk/content/bylaws.html (original)
+++ websites/staging/crunch/trunk/content/bylaws.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
-          <p>This document defines the bylaws under which the Apache Crunch 
+          <style type="text/css">
+/* The following code is added by
+   It was originally lifted from */
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>This document defines the bylaws under which the Apache Crunch 
 project operates. It defines the roles and responsibilities of the 
 project, who may vote, how voting works, how conflicts are resolved, etc. </p>
 <p>Crunch is a project of the
@@ -159,11 +170,11 @@ of principles, known collectively as the
 Apache development, please refer to the
 <a href="">Incubator project</a> for more information
 on how Apache projects operate. </p>
-<h2 id="roles-and-responsibilities">Roles and Responsibilities</h2>
+<h2 id="roles-and-responsibilities">Roles and Responsibilities<a class="headerlink" href="#roles-and-responsibilities" title="Permanent link">&para;</a></h2>
 <p>Apache projects define a set of roles with associated rights and 
 responsibilities. These roles govern what tasks an individual may 
 perform within the project. The roles are defined in the following sections. </p>
-<h3 id="users">Users</h3>
+<h3 id="users">Users<a class="headerlink" href="#users" title="Permanent link">&para;</a></h3>
 <p>The most important participants in the project are people who use our 
 software. The majority of our contributors start out as users and guide 
 their development efforts from the user's perspective. </p>
@@ -171,13 +182,13 @@ their development efforts from the user'
 contributors in the form of bug reports and feature suggestions. As 
 well, users participate in the Apache community by helping other users 
 on mailing lists and user support forums. </p>
-<h3 id="contributors">Contributors</h3>
+<h3 id="contributors">Contributors<a class="headerlink" href="#contributors" title="Permanent link">&para;</a></h3>
 <p>All of the volunteers who are contributing time, code, documentation, or 
 resources to the Crunch project. A contributor that makes sustained, 
 welcome contributions to the project may be invited to become a 
 committer, though the exact timing of such invitations depends on many 
 factors. </p>
-<h3 id="committers">Committers</h3>
+<h3 id="committers">Committers<a class="headerlink" href="#committers" title="Permanent link">&para;</a></h3>
 <p>The project's committers are responsible for the project's technical 
 management. They have access to all of the project's code repositories
 and may cast binding votes on any technical discussion regarding the
@@ -200,7 +211,7 @@ more details on the requirements for com
 invited to become a member of the PMC. The form of contribution is not 
 limited to code. It can also include code review, helping out users on 
 the mailing lists, documentation, etc. </p>
-<h3 id="project-management-committee">Project Management Committee</h3>
+<h3 id="project-management-committee">Project Management Committee<a class="headerlink" href="#project-management-committee" title="Permanent link">&para;</a></h3>
 <p>The Project Management Committee (PMC) is responsible to the board and
 the ASF for the management and oversight of the Apache Crunch codebase.
 The responsibilities of the PMC include: </p>
@@ -232,13 +243,13 @@ Crunch project. </p>
 the chair resigns before the end of his or her term, the PMC votes to
 recommend a new chair using lazy consensus, but the decision must be ratified
 by the Apache board. </p>
-<h2 id="decision-making">Decision Making</h2>
+<h2 id="decision-making">Decision Making<a class="headerlink" href="#decision-making" title="Permanent link">&para;</a></h2>
 <p>Within the Apache Crunch project, different types of decisions require 
 different forms of approval. For example, the previous section describes 
 several decisions which require "lazy consensus" approval. This section 
 defines how voting is performed, the types of approvals, and which types 
 of decision require which type of approval. </p>
-<h3 id="voting">Voting</h3>
+<h3 id="voting">Voting<a class="headerlink" href="#voting" title="Permanent link">&para;</a></h3>
 <p>Decisions regarding the project are made by votes on the primary project 
 development mailing list Where necessary, PMC 
 voting may take place on the private Crunch PMC mailing list 
@@ -294,7 +305,7 @@ codebase. These typically take the form
 commit message sent when the commit is made. Note that this should be a 
 rare occurrence. All efforts should be made to discuss issues when they 
 are still patches before the code is committed. </p>
-<h3 id="approvals">Approvals</h3>
+<h3 id="approvals">Approvals<a class="headerlink" href="#approvals" title="Permanent link">&para;</a></h3>
 <p>These are the types of approvals that can be sought. Different actions 
 require different types of approvals. </p>
 <table class="table">
@@ -339,7 +350,7 @@ require different types of approvals. </
-<h3 id="vetoes">Vetoes</h3>
+<h3 id="vetoes">Vetoes<a class="headerlink" href="#vetoes" title="Permanent link">&para;</a></h3>
 <p>A valid, binding veto cannot be overruled. If a veto is cast, it must
 be accompanied by a valid reason explaining the reasons for the
 veto. The validity of a veto, if challenged, can be confirmed by
@@ -348,7 +359,7 @@ agreement with the veto - merely that th
 <p>If you disagree with a valid veto, you must lobby the person casting the 
 veto to withdraw his or her veto. If a veto is not withdrawn, the action 
 that has been vetoed must be reversed in a timely manner. </p>
-<h3 id="actions">Actions</h3>
+<h3 id="actions">Actions<a class="headerlink" href="#actions" title="Permanent link">&para;</a></h3>
 <p>This section describes the various actions which are undertaken within 
 the project, the corresponding approval required for that action and 
 those who have binding votes over the action. It also specifies the 

Modified: websites/staging/crunch/trunk/content/download.html
--- websites/staging/crunch/trunk/content/download.html (original)
+++ websites/staging/crunch/trunk/content/download.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
-          <p>The Apache Crunch libraries are distributed under the <a href="">Apache License 2.0</a>.</p>
+          <style type="text/css">
+/* The following code is added by
+   It was originally lifted from */
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>The Apache Crunch libraries are distributed under the <a href="">Apache License 2.0</a>.</p>
 <p>The link in the Download column takes you to a list of mirrors based on
 your location. Checksum and signature are located on Apache's main
 distribution site.</p>

Modified: websites/staging/crunch/trunk/content/future-work.html
--- websites/staging/crunch/trunk/content/future-work.html (original)
+++ websites/staging/crunch/trunk/content/future-work.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
-          <p>This section contains an almost certainly incomplete list of known limitations and plans for future work.</p>
+          <style type="text/css">
+/* The following code is added by
+   It was originally lifted from */
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>This section contains an almost certainly incomplete list of known limitations and plans for future work.</p>
 <li>We would like to have easy support for reading and writing data from/to the Hive metastore via the HCatalog

Modified: websites/staging/crunch/trunk/content/getting-started.html
--- websites/staging/crunch/trunk/content/getting-started.html (original)
+++ websites/staging/crunch/trunk/content/getting-started.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,11 +145,22 @@
-          <p><em>Getting Started</em> will guide you through the process of creating a simple Crunch pipeline to count
+          <style type="text/css">
+/* The following code is added by
+   It was originally lifted from */
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><em>Getting Started</em> will guide you through the process of creating a simple Crunch pipeline to count
 the words in a text document, which is the Hello World of distributed computing. Along the way,
 we'll explain the core Crunch concepts and how to use them to create effective and efficient data
-<h1 id="overview">Overview</h1>
+<h1 id="overview">Overview<a class="headerlink" href="#overview" title="Permanent link">&para;</a></h1>
 <p>The Apache Crunch project develops and supports Java APIs that simplify the process of creating data pipelines on top of Apache Hadoop. The
 Crunch APIs are modeled after <a href="">FlumeJava (PDF)</a>, which is the library that
 Google uses for building data pipelines on top of their own implementation of MapReduce.</p>
@@ -172,7 +183,7 @@ they represent their data, which makes C
 <a href="">geospatial</a> and
 <a href="">time series</a> data, and data stored in <a href="">Apache HBase</a> tables.</li>
-<h1 id="which-version-of-crunch-do-i-need">Which Version of Crunch Do I Need?</h1>
+<h1 id="which-version-of-crunch-do-i-need">Which Version of Crunch Do I Need?<a class="headerlink" href="#which-version-of-crunch-do-i-need" title="Permanent link">&para;</a></h1>
 <p>The core libraries are primarily developed against Hadoop 1.1.2, and are also tested against Hadoop 2.2.0.
 They should work with any version of Hadoop 1.x after 1.0.3 and any version of Hadoop 2.x after 2.0.0-alpha,
 although you should note that some of Hadoop 2.x's dependencies changed between 2.0.4-alpha and 2.2.0 (for example,
@@ -200,7 +211,7 @@ prior versions of crunch-hbase were deve
-<h2 id="maven-dependencies">Maven Dependencies</h2>
+<h2 id="maven-dependencies">Maven Dependencies<a class="headerlink" href="#maven-dependencies" title="Permanent link">&para;</a></h2>
 <p>The Crunch project provides Maven artifacts on Maven Central of the form:</p>
@@ -221,7 +232,7 @@ pipelines. Depending on your use case, y
 <li><code>crunch-examples</code>: Example MapReduce and HBase pipelines</li>
 <li><code>crunch-archetype</code>: A Maven archetype for creating new Crunch pipeline projects</li>
-<h2 id="building-from-source">Building From Source</h2>
+<h2 id="building-from-source">Building From Source<a class="headerlink" href="#building-from-source" title="Permanent link">&para;</a></h2>
 <p>You can download the most recently released Crunch libraries from the <a href="download.html">Download</a> page or from the Maven
 Central Repository.</p>
 <p>If you prefer, you can also build the Crunch libraries from the source code using Maven and install
@@ -241,7 +252,7 @@ it in your local repository:</p>
 AverageBytesByIP and TotalBytesByIP take as input a file in the Common Log Format (an example is provided in
 <code>crunch-examples/src/main/resources/access_logs.tar.gz</code>.) The WordAggregationHBase requires an Apache HBase cluster to be
 available, but creates tables and loads sample data as part of its run.</p>
-<h1 id="your-first-crunch-pipeline">Your First Crunch Pipeline</h1>
+<h1 id="your-first-crunch-pipeline">Your First Crunch Pipeline<a class="headerlink" href="#your-first-crunch-pipeline" title="Permanent link">&para;</a></h1>
 <p>There are a couple of ways to get started with Crunch. If you use Git, you can
 clone this project which contains an <a href="">example Crunch pipeline</a>:</p>
@@ -318,7 +329,7 @@ files, while <code>&lt;out&gt;</code> is
 Java applications or from unit tests. All required dependencies are on Maven's
 classpath so you can run the <code>WordCount</code> class directly without any additional
-<h2 id="walking-through-the-wordcount-example">Walking Through The WordCount Example</h2>
+<h2 id="walking-through-the-wordcount-example">Walking Through The WordCount Example<a class="headerlink" href="#walking-through-the-wordcount-example" title="Permanent link">&para;</a></h2>
 <p>Let's walk through the <code>run</code> method of the <code>WordCount</code> example line by line and explain the
 data processing concepts we encounter.</p>
 <p>Our WordCount application starts out with a <code>main</code> method that should be familiar to most

Modified: websites/staging/crunch/trunk/content/index.html
--- websites/staging/crunch/trunk/content/index.html (original)
+++ websites/staging/crunch/trunk/content/index.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -147,7 +147,18 @@
-          <hr />
+          <style type="text/css">
+/* The following code is added by
+   It was originally lifted from */
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
+<hr />
 <p>The <em>Apache Crunch</em> Java library provides a framework for writing, testing,
 and running MapReduce pipelines. Its goal is to make pipelines that are

Modified: websites/staging/crunch/trunk/content/mailing-lists.html
--- websites/staging/crunch/trunk/content/mailing-lists.html (original)
+++ websites/staging/crunch/trunk/content/mailing-lists.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
-          <!--
+          <style type="text/css">
+/* The following code is added by
+   It was originally lifted from */
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
   Markdown-generated tables don't have the proper CSS classes,
   so we use plain HTML tables.

Modified: websites/staging/crunch/trunk/content/pipelines.html
--- websites/staging/crunch/trunk/content/pipelines.html (original)
+++ websites/staging/crunch/trunk/content/pipelines.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,11 +145,22 @@
-          <p>This section discusses the different steps of creating your own Crunch pipelines in more detail.</p>
-<h2 id="writing-a-dofn">Writing a DoFn</h2>
+          <style type="text/css">
+/* The following code is added by
+   It was originally lifted from */
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>This section discusses the different steps of creating your own Crunch pipelines in more detail.</p>
+<h2 id="writing-a-dofn">Writing a DoFn<a class="headerlink" href="#writing-a-dofn" title="Permanent link">&para;</a></h2>
 <p>The DoFn class is designed to keep the complexity of the MapReduce APIs out of your way when you
 don't need them while still keeping them accessible when you do.</p>
-<h3 id="serialization">Serialization</h3>
+<h3 id="serialization">Serialization<a class="headerlink" href="#serialization" title="Permanent link">&para;</a></h3>
 <p>First, all DoFn instances are required to be <code></code>. This is a key aspect of the library's design:
 once a particular DoFn is assigned to the Map or Reduce stage of a MapReduce job, all of the state
 of that DoFn is serialized so that it may be distributed to all of the nodes in the Hadoop cluster that
@@ -163,15 +174,15 @@ will be running that task. There are two
 such as creating a non-serializable member variable, can be performed before processing begins. Similarly, all
 DoFn instances have a <code>cleanup</code> method that may be called after processing has finished to perform any required
 cleanup tasks.</p>
-<h3 id="scale-factor">Scale Factor</h3>
+<h3 id="scale-factor">Scale Factor<a class="headerlink" href="#scale-factor" title="Permanent link">&para;</a></h3>
 <p>The DoFn class defines a <code>scaleFactor</code> method that can be used to signal to the MapReduce compiler that a particular
 DoFn implementation will yield an output PCollection that is larger (scaleFactor &gt; 1) or smaller (0 &lt; scaleFactor &lt; 1)
 than the input PCollection it is applied to. The compiler may use this information to determine how to optimally
 split processing tasks between the Map and Reduce phases of dependent MapReduce jobs.</p>
-<h3 id="other-utilities">Other Utilities</h3>
+<h3 id="other-utilities">Other Utilities<a class="headerlink" href="#other-utilities" title="Permanent link">&para;</a></h3>
 <p>The DoFn base class provides convenience methods for accessing the <code>Configuration</code> and <code>Counter</code> objects that
 are associated with a MapReduce stage, so that they may be accessed during initialization, processing, and cleanup.</p>
-<h3 id="performing-cogroups-and-joins">Performing Cogroups and Joins</h3>
+<h3 id="performing-cogroups-and-joins">Performing Cogroups and Joins<a class="headerlink" href="#performing-cogroups-and-joins" title="Permanent link">&para;</a></h3>
 <p>Cogroups and joins are performed on PTable instances that have the same key type. This section walks through
 the basic flow of a cogroup operation, explaining how this higher-level operation is composed of the four primitive operations.
 In general, these common operations are provided as part of the core library or in extensions, you do not need

Modified: websites/staging/crunch/trunk/content/scrunch.html
--- websites/staging/crunch/trunk/content/scrunch.html (original)
+++ websites/staging/crunch/trunk/content/scrunch.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -147,11 +147,22 @@
-          <h2 id="introduction">Introduction</h2>
+          <style type="text/css">
+/* The following code is added by
+   It was originally lifted from */
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
+<h2 id="introduction">Introduction<a class="headerlink" href="#introduction" title="Permanent link">&para;</a></h2>
 <p>Scrunch is an experimental Scala wrapper for the Apache Crunch Java API, based on the same ideas as the
 <a href="">Cascade</a> project at Google, which created a Scala wrapper for
-<h2 id="why-scala">Why Scala?</h2>
+<h2 id="why-scala">Why Scala?<a class="headerlink" href="#why-scala" title="Permanent link">&para;</a></h2>
 <p>In many ways, Scala is the perfect language for writing MapReduce pipelines. Scala supports
 a mixture of functional and object-oriented programming styles and has powerful type-inference
 capabilities, allowing us to create complex pipelines using very few keystrokes. Here is
@@ -189,7 +200,7 @@ the second:</p>
-<h2 id="materializing-job-outputs">Materializing Job Outputs</h2>
+<h2 id="materializing-job-outputs">Materializing Job Outputs<a class="headerlink" href="#materializing-job-outputs" title="Permanent link">&para;</a></h2>
 <p>The Scrunch API also incorporates the Java library's <code>materialize</code> functionality, which allows us to easily read
 the output of a MapReduce pipeline into the client:</p>
 <div class="codehilite"><pre><span class="n">class</span> <span class="n">WordCountExample</span> <span class="p">{</span>
@@ -198,7 +209,7 @@ the output of a MapReduce pipeline into
-<h2 id="notes-and-thanks">Notes and Thanks</h2>
+<h2 id="notes-and-thanks">Notes and Thanks<a class="headerlink" href="#notes-and-thanks" title="Permanent link">&para;</a></h2>
 <p>Scrunch emerged out of conversations with <a href="!/squarecog">Dmitriy Ryaboy</a>,
 <a href="!/posco">Oscar Boykin</a>, and <a href="!/avibryant">Avi Bryant</a> from Twitter.
 Many thanks to them for their feedback, guidance, and encouragement. We are also grateful to

Modified: websites/staging/crunch/trunk/content/source-repository.html
--- websites/staging/crunch/trunk/content/source-repository.html (original)
+++ websites/staging/crunch/trunk/content/source-repository.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
-          <p>The Apache Crunch Project uses <a href="">Git</a> for version control. Run the
+          <style type="text/css">
+/* The following code is added by
+   It was originally lifted from */
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>The Apache Crunch Project uses <a href="">Git</a> for version control. Run the
 following command to clone the repository:</p>
 <div class="codehilite"><pre><span class="n">git</span> <span class="n">clone</span> <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">git</span><span class="o">-</span><span class="n">wip</span><span class="o">-</span><span class="n">us</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">org</span><span class="o">/</span><span class="n">repos</span><span class="o">/</span><span class="n">asf</span><span class="o">/</span><span class="n">crunch</span><span class="p">.</span><span class="n">git</span>

Modified: websites/staging/crunch/trunk/content/user-guide.html
--- websites/staging/crunch/trunk/content/user-guide.html (original)
+++ websites/staging/crunch/trunk/content/user-guide.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
-          <ol>
+          <style type="text/css">
+/* The following code is added by
+   It was originally lifted from */
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
 <li><a href="#intro">Introduction to Crunch</a><ol>
 <li><a href="#motivation">Motivation</a></li>
 <li><a href="#datamodel">Data Model and Operators</a></li>
@@ -212,9 +223,9 @@
 <li><a href="#testing">Unit Testing Pipelines</a></li>
 <p><a name="intro"></a></p>
-<h2 id="introduction-to-crunch">Introduction to Crunch</h2>
+<h2 id="introduction-to-crunch">Introduction to Crunch<a class="headerlink" href="#introduction-to-crunch" title="Permanent link">&para;</a></h2>
 <p><a name="motivation"></a></p>
-<h3 id="motivation">Motivation</h3>
+<h3 id="motivation">Motivation<a class="headerlink" href="#motivation" title="Permanent link">&para;</a></h3>
 <p>Let's start with a basic question: why should you use <em>any</em> high-level tool for writing data pipelines, as opposed to developing against
 the MapReduce, Spark, or Tez APIs directly? Doesn't adding another layer of abstraction just increase the number of moving pieces you
 need to worry about, ala the <a href="">Law of Leaky Abstractions</a>?</p>
@@ -305,7 +316,7 @@ top of Apache Hadoop:</p>
 <p>In the next section, we'll give a quick overview of Crunch's version of these abstractions and how they relate to each other before going
 into more detail about their usage in the rest of the guide.</p>
 <p><a name="datamodel"></a></p>
-<h3 id="data-model-and-operators">Data Model and Operators</h3>
+<h3 id="data-model-and-operators">Data Model and Operators<a class="headerlink" href="#data-model-and-operators" title="Permanent link">&para;</a></h3>
 <p>Crunch's Java API is centered around three interfaces that represent distributed datasets: <a href="apidocs/0.10.0/org/apache/crunch/PCollection.html">PCollection<T></a>,
 <a href="">PTable<K, V></a>, and <a href="apidocs/0.10.0/org/apache/crunch/PGroupedTable.html">PGroupedTable<K, V></a>.</p>
 <p>A <code>PCollection&lt;T&gt;</code> represents a distributed, immutable collection of elements of type T. For example, we represent a text file as a
@@ -336,12 +347,12 @@ that are available for developers to use
 <li><a href="apidocs/0.10.0/org/apache/crunch/impl/spark/SparkPipeline.html">SparkPipeline</a>: Executes the pipeline by converting it to a series of Spark pipelines.</li>
 <p><a name="dataproc"></a></p>
-<h2 id="data-processing-with-dofns">Data Processing with DoFns</h2>
+<h2 id="data-processing-with-dofns">Data Processing with DoFns<a class="headerlink" href="#data-processing-with-dofns" title="Permanent link">&para;</a></h2>
 <p>DoFns represent the logical computations of your Crunch pipelines. They are designed to be easy to write, easy to test, and easy to deploy
 within the context of a MapReduce job. Much of your work with the Crunch APIs will be writing DoFns, and so having a good understanding of
 how to use them effectively is critical to crafting elegant and efficient pipelines.</p>
 <p><a name="dovsmap"></a></p>
-<h3 id="dofns-vs-mapper-and-reducer-classes">DoFns vs. Mapper and Reducer Classes</h3>
+<h3 id="dofns-vs-mapper-and-reducer-classes">DoFns vs. Mapper and Reducer Classes<a class="headerlink" href="#dofns-vs-mapper-and-reducer-classes" title="Permanent link">&para;</a></h3>
 <p>Let's see how DoFns compare to the Mapper and Reducer classes that you're used to writing when working with Hadoop's MapReduce API. When
 you're creating a MapReduce job, you start by declaring an instance of the <code>Job</code> class and using its methods to declare the implementations
 of the <code>Mapper</code> and <code>Reducer</code> classes that you want to use:</p>
@@ -431,7 +442,7 @@ of <code>static</code> methods on the cl
 regardless of whether the outer class is serializable or not. Using static methods to define your business logic in terms of a series of
 DoFns can also make your code easier to test by using in-memory PCollection implementations in your unit tests.</p>
 <p><a name="runproc"></a></p>
-<h3 id="runtime-processing-steps">Runtime Processing Steps</h3>
+<h3 id="runtime-processing-steps">Runtime Processing Steps<a class="headerlink" href="#runtime-processing-steps" title="Permanent link">&para;</a></h3>
 <p>After the Crunch runtime loads the serialized DoFns into its map and reduce tasks, the DoFns are executed on the input data via the following
@@ -450,7 +461,7 @@ be used to emit the sum of a list of num
 other cleanup task that is appropriate once the task has finished executing.</li>
 <p><a name="mrapis"></a></p>
-<h3 id="accessing-runtime-mapreduce-apis">Accessing Runtime MapReduce APIs</h3>
+<h3 id="accessing-runtime-mapreduce-apis">Accessing Runtime MapReduce APIs<a class="headerlink" href="#accessing-runtime-mapreduce-apis" title="Permanent link">&para;</a></h3>
 <p>DoFns provide direct access to the <code>TaskInputOutputContext</code> object that is used within a given Map or Reduce task via the <code>getContext</code>
 method. There are also a number of helper methods for working with the objects associated with the TaskInputOutputContext, including:</p>
@@ -475,7 +486,7 @@ objects returned by Crunch at the end of
 Counter classes directly in your Crunch pipelines (the two <code>getCounter</code> methods that were defined in DoFn are both deprecated) so that you will not be
 required to recompile your job jars when you move from a Hadoop 1.0 cluster to a Hadoop 2.0 cluster.)</p>
 <p><a name="doplan"></a></p>
-<h3 id="configuring-the-crunch-planner-and-mapreduce-jobs-with-dofns">Configuring the Crunch Planner and MapReduce Jobs with DoFns</h3>
+<h3 id="configuring-the-crunch-planner-and-mapreduce-jobs-with-dofns">Configuring the Crunch Planner and MapReduce Jobs with DoFns<a class="headerlink" href="#configuring-the-crunch-planner-and-mapreduce-jobs-with-dofns" title="Permanent link">&para;</a></h3>
 <p>Although most of the DoFn methods are focused on runtime execution, there are a handful of methods that are used during the planning phase
 before a pipeline is converted into MapReduce jobs. The first of these functions is <code>float scaleFactor()</code>, which should return a floating point
 value greater than 0.0f. You can override the scaleFactor method in your custom DoFns in order to provide a hint to the Crunch planner about
@@ -488,7 +499,7 @@ on the client before processing begins b
 will require extra memory settings to run, and so you could make sure that the value of the <code></code> argument had a large enough
 memory setting for the DoFn's needs before the job was launched on the cluster.</p>
 <p><a name="mapfn"></a></p>
-<h3 id="common-dofn-patterns">Common DoFn Patterns</h3>
+<h3 id="common-dofn-patterns">Common DoFn Patterns<a class="headerlink" href="#common-dofn-patterns" title="Permanent link">&para;</a></h3>
 <p>The Crunch APIs contain a number of useful subclasses of DoFn that handle common data processing scenarios and are easier
 to write and test. The top-level <a href="apidocs/0.10.0/org/apache/crunch/package-summary.html">org.apache.crunch</a> package contains three
 of the most important specializations, which we will discuss now. Each of these specialized DoFn implementations has associated methods
@@ -519,7 +530,7 @@ interface, which is defined right alongs
 interface defined via static factory methods in the <a href="apidocs/0.10.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> class. We will discuss
 Aggregators more in the section on <a href="#aggregators">common MapReduce patterns</a>.</p>
 <p><a name="serde"></a></p>
-<h2 id="serializing-data-with-ptypes">Serializing Data with PTypes</h2>
+<h2 id="serializing-data-with-ptypes">Serializing Data with PTypes<a class="headerlink" href="#serializing-data-with-ptypes" title="Permanent link">&para;</a></h2>
 <p>Every <code>PCollection&lt;T&gt;</code> has an associated <code>PType&lt;T&gt;</code> that encapsulates the information on how to serialize and deserialize the contents of that
 PCollection. PTypes are necessary because of <a href="">type erasure</a>; at runtime, when
 the Crunch planner is mapping from PCollections to a series of MapReduce jobs, the type of a PCollection (that is, the <code>T</code> in <code>PCollection&lt;T&gt;</code>)
@@ -548,7 +559,7 @@ to mix-and-match PCollections that use d
 read in Writable data, do a shuffle using Avro, and then write the output data as Writables), but each PCollection's PType must belong to a single
 type family; for example, you cannot have a PTable whose key is serialized as a Writable and whose value is serialized as an Avro record.</p>
 <p><a name="corept"></a></p>
-<h3 id="core-ptypes">Core PTypes</h3>
+<h3 id="core-ptypes">Core PTypes<a class="headerlink" href="#core-ptypes" title="Permanent link">&para;</a></h3>
 <p>Both type families support a common set of primitive types (strings, longs, ints, floats, doubles, booleans, and bytes) as well as more complex
 PTypes that can be constructed out of other PTypes:</p>
@@ -673,7 +684,7 @@ for POJOs using Avro's reflection-based
 and easy to test, but the fact that the data is written out as Avro records means that you can use tools like Hive and Pig
 to query intermediate results to aid in debugging pipeline failures.</p>
 <p><a name="extendpt"></a></p>
-<h3 id="extending-ptypes">Extending PTypes</h3>
+<h3 id="extending-ptypes">Extending PTypes<a class="headerlink" href="#extending-ptypes" title="Permanent link">&para;</a></h3>
 <p>The simplest way to create a new <code>PType&lt;T&gt;</code> for a data object is to create a <em>derived</em> PType from one of the built-in PTypes from the Avro
 and Writable type families. If we have a base <code>PType&lt;S&gt;</code>, we can create a derived <code>PType&lt;T&gt;</code> by implementing an input <code>MapFn&lt;S, T&gt;</code> and an
 output <code>MapFn&lt;T, S&gt;</code> and then calling <code>PTypeFamily.derived(Class&lt;T&gt;, MapFn&lt;S, T&gt; in, MapFn&lt;T, S&gt; out, PType&lt;S&gt; base)</code>, which will return
@@ -695,7 +706,7 @@ easy to work with the POJO directly in y
 <p><a name="rwdata"></a></p>
-<h2 id="reading-and-writing-data">Reading and Writing Data</h2>
+<h2 id="reading-and-writing-data">Reading and Writing Data<a class="headerlink" href="#reading-and-writing-data" title="Permanent link">&para;</a></h2>
 <p>In the introduction to this user guide, we noted that all of the major tools for working with data pipelines on Hadoop include some sort of abstraction
 for working with the <code>InputFormat&lt;K, V&gt;</code> and <code>OutputFormat&lt;K, V&gt;</code> classes defined in the MapReduce APIs. For example, Hive includes
 SerDes, and Pig requires LoadFuncs and StoreFuncs. Let's take a moment to explain what functionality these abstractions provide for
@@ -719,13 +730,13 @@ to wrap an InputFormat and its associate
 job, even if those Sources have the same InputFormat. On the output side, the <code>Target</code> interface can be used in the same way to wrap a
 Hadoop <code>OutputFormat</code> and its associated key-value pairs in a way that can be isolated from any other outputs of a pipeline stage.</p>
 <p><a name="notethis"></a></p>
-<h3 id="a-note-on-sources-targets-and-hadoop-apis">A Note on Sources, Targets, and Hadoop APIs</h3>
+<h3 id="a-note-on-sources-targets-and-hadoop-apis">A Note on Sources, Targets, and Hadoop APIs<a class="headerlink" href="#a-note-on-sources-targets-and-hadoop-apis" title="Permanent link">&para;</a></h3>
 <p>Crunch, like Hive and Pig, is developed against the <a href="">org.apache.hadoop.mapreduce</a> API, not the older <a href="">org.apache.hadoop.mapred</a> API.
 This means that Crunch Sources and Targets expect subclasses of the new <a href="">InputFormat</a> and <a href="">OutputFormat</a> classes. These new
 classes are not 1:1 compatible with the InputFormat and OutputFormat classes associated with the <code>org.apache.hadoop.mapred</code> APIs, so please be
 aware of this difference when considering using existing InputFormats and OutputFormats with Crunch's Sources and Targets.</p>
 <p><a name="sources"></a></p>
-<h3 id="sources">Sources</h3>
+<h3 id="sources">Sources<a class="headerlink" href="#sources" title="Permanent link">&para;</a></h3>
 <p>Crunch defines both <code>Source&lt;T&gt;</code> and <code>TableSource&lt;K, V&gt;</code> interfaces that allow us to read an input as a <code>PCollection&lt;T&gt;</code> or a <code>PTable&lt;K, V&gt;</code>.
 You use a Source in conjunction with one of the <code>read</code> methods on the Pipeline interface:</p>
@@ -801,7 +812,7 @@ different files using the NLineInputForm
 <p><a name="targets"></a></p>
-<h3 id="targets">Targets</h3>
+<h3 id="targets">Targets<a class="headerlink" href="#targets" title="Permanent link">&para;</a></h3>
 <p>Crunch's <code>Target</code> interface is the analogue of <code>Source&lt;T&gt;</code> for OutputFormats. You create Targets for use with the <code>write</code> method
 defined on the <code>Pipeline</code> interface:</p>
@@ -873,7 +884,7 @@ parameters that this Target needs:</p>
 <p><a name="srctargets"></a></p>
-<h3 id="sourcetargets-and-write-modes">SourceTargets and Write Modes</h3>
+<h3 id="sourcetargets-and-write-modes">SourceTargets and Write Modes<a class="headerlink" href="#sourcetargets-and-write-modes" title="Permanent link">&para;</a></h3>
 <p>The <code>SourceTarget&lt;T&gt;</code> interface extends both the <code>Source&lt;T&gt;</code> and <code>Target</code> interfaces and allows a Path to act as both a
 Target for some PCollections as well as a Source for others. SourceTargets are convenient for any intermediate outputs within
 your pipeline. Just as we have the factory methods in the From and To classes for Sources and Targets, factory methods for
@@ -904,7 +915,7 @@ WriteModes for Crunch:</p>
 <p><a name="materialize"></a></p>
-<h3 id="materializing-data-into-the-client">Materializing Data Into the Client</h3>
+<h3 id="materializing-data-into-the-client">Materializing Data Into the Client<a class="headerlink" href="#materializing-data-into-the-client" title="Permanent link">&para;</a></h3>
 <p>In many analytical applications, we need to use the output of one phase of a data pipeline in order to configure subsequent pipeline
 stages. For example, many machine learning applications require that we iterate over a dataset until some convergence criteria is
 met. Crunch provides API methods that make it possible to materialize the data from a PCollection and stream the resulting data into
@@ -930,12 +941,12 @@ interface that has an associated <code>V
 of elements contained in that PCollection, but the pipeline tasks required to compute this value will not run until the <code>Long getValue()</code>
 method of the returned PObject is called.</p>
 <p><a name="patterns"></a></p>
-<h2 id="data-processing-patterns-in-crunch">Data Processing Patterns in Crunch</h2>
+<h2 id="data-processing-patterns-in-crunch">Data Processing Patterns in Crunch<a class="headerlink" href="#data-processing-patterns-in-crunch" title="Permanent link">&para;</a></h2>
 <p>This section describes the various data processing patterns implemented in Crunch's library APIs,
 which are in the <a href="apidocs/0.10.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
 <p><a name="gbk"></a></p>
-<h3 id="groupbykey">groupByKey</h3>
+<h3 id="groupbykey">groupByKey<a class="headerlink" href="#groupbykey" title="Permanent link">&para;</a></h3>
 <p>Most of the data processing patterns described in this section rely on PTable's groupByKey method,
 which controls how data is shuffled and aggregated by the underlying execution engine. The groupByKey
 method has three flavors on the PTable interface:</p>
@@ -973,7 +984,7 @@ same classes may also be used with other
 options specified that will only be applied to the job that actually executes that phase of the data
 <p><a name="aggregators"></a></p>
-<h3 id="combinevalues">combineValues</h3>
+<h3 id="combinevalues">combineValues<a class="headerlink" href="#combinevalues" title="Permanent link">&para;</a></h3>
 <p>Calling one of the groupByKey methods on PTable returns an instance of the PGroupedTable interface.
 PGroupedTable provides a <code>combineValues</code> that can be used to signal to the planner that we want to perform
 associative aggregations on our data both before and after the shuffle.</p>
@@ -1018,7 +1029,7 @@ the average of a set of values:</p>
 <p><a name="simpleagg"></a></p>
-<h3 id="simple-aggregations">Simple Aggregations</h3>
+<h3 id="simple-aggregations">Simple Aggregations<a class="headerlink" href="#simple-aggregations" title="Permanent link">&para;</a></h3>
 <p>Many of the most common aggregation patterns in Crunch are provided as methods on the PCollection
 interface, including <code>count</code>, <code>max</code>, <code>min</code>, and <code>length</code>. The implementations of these methods,
 however, are in the <a href="apidocs/0.10.0/org/apache/crunch/lib/Aggregate.html">Aggregate</a> library class.
@@ -1040,7 +1051,7 @@ most frequently occuring elements, you w
 <p><a name="joins"></a></p>
-<h3 id="joining-data">Joining Data</h3>
+<h3 id="joining-data">Joining Data<a class="headerlink" href="#joining-data" title="Permanent link">&para;</a></h3>
 <p>Joins in Crunch are based on equal-valued keys in different PTables. Joins have also evolved
 a great deal in Crunch over the lifetime of the project. The <a href="apidocs/0.10.0/org/apache/crunch/lib/Join.html">Join</a>
 API provides simple methods for performing equijoins, left joins, right joins, and full joins, but modern
@@ -1064,14 +1075,14 @@ a given key in a PCollection, so joining
 surprising results. Using a non-null dummy value in your PCollections is a good idea in
 <p><a name="reducejoin"></a></p>
-<h4 id="reduce-side-joins">Reduce-side Joins</h4>
+<h4 id="reduce-side-joins">Reduce-side Joins<a class="headerlink" href="#reduce-side-joins" title="Permanent link">&para;</a></h4>
 <p>Reduce-side joins are handled by the <a href="apidocs/0.10.0/org/apache/crunch/lib/join/DefaultJoinStrategy.html">DefaultJoinStrategy</a>.
 Reduce-side joins are the simplest and most robust kind of joins in Hadoop; the keys from the two inputs are
 shuffled together to the reducers, where the values from the smaller of the two collections are collected and then
 streamed over the values from the larger of the two collections. You can control the number of reducers that is used
 to perform the join by passing an integer argument to the DefaultJoinStrategy constructor.</p>
 <p><a name="mapjoin"></a></p>
-<h4 id="map-side-joins">Map-side Joins</h4>
+<h4 id="map-side-joins">Map-side Joins<a class="headerlink" href="#map-side-joins" title="Permanent link">&para;</a></h4>
 <p>Map-side joins are handled by the <a href="apidocs/0.10.0/org/apache/crunch/lib/join/MapsideJoinStrategy.html">MapsideJoinStrategy</a>.
 Map-side joins require that the smaller of the two input tables is loaded into memory on the tasks on the cluster, so
 there is a requirement that at least one of the tables be relatively small so that it can comfortably fit into memory within
@@ -1084,7 +1095,7 @@ recommend that you use the <code>Mapside
 implementation of the MapsideJoinStrategy in which the left-side PTable is loaded into
 memory instead of the right-side PTable.</p>
 <p><a name="shardedjoin"></a></p>
-<h4 id="sharded-joins">Sharded Joins</h4>
+<h4 id="sharded-joins">Sharded Joins<a class="headerlink" href="#sharded-joins" title="Permanent link">&para;</a></h4>
 <p>Many distributed joins have skewed data that can cause regular reduce-side joins to fail due to out-of-memory issues on
 the partitions that happen to contain the keys with highest cardinality. To handle these skew issues, Crunch has the
 <a href="apidocs/0.10.0/org/apache/crunch/lib/join/ShardedJoinStrategy.html">ShardedJoinStrategy</a> that allows developers to shard
@@ -1092,7 +1103,7 @@ each key to multiple reducers, which pre
 in exchange for sending more data over the wire. For problems with significant skew issues, the ShardedJoinStrategy can
 significantly improve performance.</p>
 <p><a name="bloomjoin"></a></p>
-<h4 id="bloom-filter-joins">Bloom Filter Joins</h4>
+<h4 id="bloom-filter-joins">Bloom Filter Joins<a class="headerlink" href="#bloom-filter-joins" title="Permanent link">&para;</a></h4>
 <p>Last but not least, the <a href="apidocs/0.10.0/org/apache/crunch/lib/join/BloomFilterJoinStrategy.html">BloomFilterJoinStrategy</a> builds
 a <a href="">bloom filter</a> on the left-hand side table that is used to filter the contents
 of the right-hand side table to eliminate entries from the (larger) right-hand side table that have no hope of being joined
@@ -1100,7 +1111,7 @@ to values in the left-hand side table. T
 into memory on the tasks of the job, but is still significantly smaller than the right-hand side table, and we know that the
 vast majority of the keys in the right-hand side table will not match the keys in the left-hand side of the table.</p>
 <p><a name="cogroups"></a></p>
-<h4 id="cogroups">Cogroups</h4>
+<h4 id="cogroups">Cogroups<a class="headerlink" href="#cogroups" title="Permanent link">&para;</a></h4>
 <p>Some kinds of joins are richer and more complex then the typical kind of relational join that are handled by JoinStrategy.
 For example, we might want to join two datasets
 together and only emit a record if each of the sets had at least two distinct values associated
@@ -1125,12 +1136,12 @@ PTable whose values are made up of Colle
 how they work, you can consult the <a href="">section on cogroups</a>
 in the Apache Pig book.</p>
 <p><a name="sorting"></a></p>
-<h3 id="sorting">Sorting</h3>
+<h3 id="sorting">Sorting<a class="headerlink" href="#sorting" title="Permanent link">&para;</a></h3>
 <p>After joins and cogroups, sorting data is the most common distributed computing pattern. The
 Crunch APIs have a number of utilities for performing fully distributed sorts as well as
 more advanced patterns like secondary sorts.</p>
 <p><a name="stdsort"></a></p>
-<h4 id="standard-and-reverse-sorting">Standard and Reverse Sorting</h4>
+<h4 id="standard-and-reverse-sorting">Standard and Reverse Sorting<a class="headerlink" href="#standard-and-reverse-sorting" title="Permanent link">&para;</a></h4>
 <p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Sort.html">Sort</a> API methods contain utility functions
 for sorting the contents of PCollections and PTables whose contents implement the <code>Comparable</code>
 interface. By default, MapReduce does not perform total sorts on its keys during a shuffle; instead
@@ -1160,7 +1171,7 @@ the <a href="apidocs/0.10.0/org/apache/c
 <p><a name="secsort"></a></p>
-<h4 id="secondary-sorts">Secondary Sorts</h4>
+<h4 id="secondary-sorts">Secondary Sorts<a class="headerlink" href="#secondary-sorts" title="Permanent link">&para;</a></h4>
 <p>Another pattern that occurs frequently in distributed processing is <em>secondary sorts</em>, where we
 want to group a set of records by one key and sort the records within each group by a second key.
 The <a href="apidocs/0.10.0/org/apache/crunch/lib/SecondarySort.html">SecondarySort</a> API provides a set
@@ -1169,11 +1180,11 @@ where <code>K</code> is the primary grou
 method will perform the grouping and sorting and will then apply a given DoFn to process the
 grouped and sorted values.</p>
 <p><a name="otheropts"></a></p>
-<h3 id="other-operations">Other Operations</h3>
+<h3 id="other-operations">Other Operations<a class="headerlink" href="#other-operations" title="Permanent link">&para;</a></h3>
 <p>Crunch provides implementations of a number of other common distributed processing patterns and
 techniques throughout its library APIs.</p>
 <p><a name="cartesian"></a></p>
-<h4 id="cartesian-products">Cartesian Products</h4>
+<h4 id="cartesian-products">Cartesian Products<a class="headerlink" href="#cartesian-products" title="Permanent link">&para;</a></h4>
 <p>Cartesian products between PCollections are a bit tricky in distributed processing; we usually want
 one of the datasets to be small enough to fit into memory, and then do a pass over the larger data
 set where we emit an element of the smaller data set along with each element from the larger set.</p>
@@ -1183,7 +1194,7 @@ provides methods for a reduce-side full
 this is a pretty expensive operation, and you should go out of your way to avoid these kinds of processing
 steps in your pipelines.</p>
 <p><a name="shard"></a></p>
-<h4 id="coalescing">Coalescing</h4>
+<h4 id="coalescing">Coalescing<a class="headerlink" href="#coalescing" title="Permanent link">&para;</a></h4>
 <p>Many MapReduce jobs have the potential to generate a large number of small files that could be used more
 effectively by clients if they were all merged together into a small number of large files. The
 <a href="apidocs/0.10.0/org/apache/crunch/lib/Shard.html">Shard</a> API provides a single method, <code>shard</code>, that allows
@@ -1196,7 +1207,7 @@ you to coalesce a given PCollection into
 <p>This has the effect of running a no-op MapReduce job that shuffles the data into the given number of
 partitions. This is often a useful step at the end of a long pipeline run.</p>
 <p><a name="distinct"></a></p>
-<h4 id="distinct">Distinct</h4>
+<h4 id="distinct">Distinct<a class="headerlink" href="#distinct" title="Permanent link">&para;</a></h4>
 <p>Crunch's <a href="apidocs/0.10.0/org/apache/crunch/lib/Distinct.html">Distinct</a> API has a method, <code>distinct</code>, that
 returns one copy of each unique element in a given PCollection:</p>
@@ -1218,7 +1229,7 @@ with another method in Distinct:</p>
 value for your own pipelines. The optimal value will depend on some combination of the size of the objects (and
 thus the amount of memory they consume) and the number of unique elements in the data.</p>
 <p><a name="sampling"></a></p>
-<h4 id="sampling">Sampling</h4>
+<h4 id="sampling">Sampling<a class="headerlink" href="#sampling" title="Permanent link">&para;</a></h4>
 <p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Sample.html">Sample</a> API provides methods for two sorts of PCollection
 sampling: random and reservoir.</p>
 <p>Random sampling is where you include each record in the same with a fixed probability, and is probably what you're
@@ -1244,11 +1255,11 @@ collection! You can read more about how
 random number generators. Note that all of the sampling algorithms Crunch provides, both random and reservoir,
 only require a single pass over the data.</p>
 <p><a name="sets"></a></p>
-<h4 id="set-operations">Set Operations</h4>
+<h4 id="set-operations">Set Operations<a class="headerlink" href="#set-operations" title="Permanent link">&para;</a></h4>
 <p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Set.html">Set</a> API methods complement Crunch's built-in <code>union</code> methods and
 provide support for finding the intersection, the difference, or the <a href="">comm</a> of two PCollections.</p>
 <p><a name="splits"></a></p>
-<h4 id="splits">Splits</h4>
+<h4 id="splits">Splits<a class="headerlink" href="#splits" title="Permanent link">&para;</a></h4>
 <p>Sometimes, you want to write two different outputs from the same DoFn into different PCollections. An example of this would
 be a pipeline in which you wanted to write good records to one file and bad or corrupted records to a different file for
 further examination. The <a href="apidocs/0.10.0/org/apache/crunch/lib/Channels.html">Channels</a> class provides a method that allows
@@ -1261,7 +1272,7 @@ you to split an input PCollection of Pai
 <p><a name="objectreuse"></a></p>
-<h3 id="retaining-objects-within-dofns">Retaining objects within DoFns</h3>
+<h3 id="retaining-objects-within-dofns">Retaining objects within DoFns<a class="headerlink" href="#retaining-objects-within-dofns" title="Permanent link">&para;</a></h3>
 <p>For reasons of efficiency, Hadoop MapReduce repeatedly passes the <a href="">same references as keys and values to Mappers and Reducers</a> instead of passing in new objects for each call. 
 The state of the singleton key and value objects is updated between each call 
 to <code></code> and <code>Reducer.reduce()</code>, as well as updating it between each 
@@ -1316,7 +1327,7 @@ the maximum value encountered would be i
 <p><a name="hbase"></a></p>
-<h2 id="crunch-for-hbase">Crunch for HBase</h2>
+<h2 id="crunch-for-hbase">Crunch for HBase<a class="headerlink" href="#crunch-for-hbase" title="Permanent link">&para;</a></h2>
 <p>Crunch is an excellent platform for creating pipelines that involve processing data from HBase tables. Because of Crunch's
 flexible schemas for PCollections and PTables, you can write pipelines that operate directly on HBase API classes like
 <code>Put</code>, <code>KeyValue</code>, and <code>Result</code>.</p>
@@ -1334,7 +1345,7 @@ hfiles directly, which is much faster th
 into HBase tables. See the utility methods in the <a href="apidocs/0.10.0/org/apache/crunch/io/hbase/HFileUtils.html">HFileUtils</a> class for
 more details on how to work with PCollections against hfiles.</p>
 <p><a name="exec"></a></p>
-<h2 id="managing-pipeline-execution">Managing Pipeline Execution</h2>
+<h2 id="managing-pipeline-execution">Managing Pipeline Execution<a class="headerlink" href="#managing-pipeline-execution" title="Permanent link">&para;</a></h2>
 <p>Crunch uses a lazy execution model. No jobs are run or outputs created until the user explicitly invokes one of the methods on the
 Pipeline interface that controls job planning and execution. The simplest of these methods is the <code>PipelineResult run()</code> method,
 which analyzes the current graph of PCollections and Target outputs and comes up with a plan to ensure that each of the outputs is
@@ -1356,11 +1367,11 @@ If the planner detects a materialized or
 PCollection to its own choice. The implementation of materialize and cache vary slightly between the MapReduce-based and Spark-based
 execution pipelines in a way that is explained in the subsequent section of the guide.</p>
 <p><a name="pipelines"></a></p>
-<h2 id="the-different-pipeline-implementations-properties-and-configuration-options">The Different Pipeline Implementations (Properties and Configuration options)</h2>
+<h2 id="the-different-pipeline-implementations-properties-and-configuration-options">The Different Pipeline Implementations (Properties and Configuration options)<a class="headerlink" href="#the-different-pipeline-implementations-properties-and-configuration-options" title="Permanent link">&para;</a></h2>
 <p>This section adds some additional details about the implementation and configuration options available for each of
 the different execution engines.</p>
 <p><a name="mrpipeline"></a></p>
-<h3 id="mrpipeline">MRPipeline</h3>
+<h3 id="mrpipeline">MRPipeline<a class="headerlink" href="#mrpipeline" title="Permanent link">&para;</a></h3>
 <p>The <a href="apidocs/0.10.0/org/apache/crunch/impl/mr/MRPipeline.html">MRPipeline</a> is the oldest implementation of the Pipeline interface and
 compiles and executes the DAG of PCollections into a series of MapReduce jobs. MRPipeline has three constructors that are commonly
@@ -1420,7 +1431,7 @@ aware of:</p>
 <p><a name="sparkpipeline"></a></p>
-<h3 id="sparkpipeline">SparkPipeline</h3>
+<h3 id="sparkpipeline">SparkPipeline<a class="headerlink" href="#sparkpipeline" title="Permanent link">&para;</a></h3>
 <p>The <code>SparkPipeline</code> is the newest implementation of the Pipeline interface, and was added in Crunch 0.10.0. It has two default constructors:</p>
 <li><code>SparkPipeline(String sparkConnection, String appName)</code> which takes a Spark connection string, which is of the form <code>local[numThreads]</code> for
@@ -1446,7 +1457,7 @@ get strange and unpredictable failures i
 be a little rough around the edges and may not handle all of the use cases that MRPipeline can handle, although the Crunch community is
 actively working to ensure complete compatibility between the two implementations.</p>
 <p><a name="mempipeline"></a></p>
-<h3 id="mempipeline">MemPipeline</h3>
+<h3 id="mempipeline">MemPipeline<a class="headerlink" href="#mempipeline" title="Permanent link">&para;</a></h3>
 <p>The <a href="apidocs/0.10.0/org/apache/crunch/impl/mem/MemPipeline.html">MemPipeline</a> implementation of Pipeline has a few interesting
 properties. First, unlike MRPipeline, MemPipeline is a singleton; you don't create a MemPipeline, you just get a reference to it
 via the static <code>MemPipeline.getInstance()</code> method. Second, all of the operations in the MemPipeline are executed completely in-memory,
@@ -1479,10 +1490,10 @@ on the read side. Often the best way to
 <code>materialize()</code> method to get a reference to the contents of the in-memory collection and then verify them directly,
 without writing them out to disk.</p>
 <p><a name="testing"></a></p>
-<h2 id="unit-testing-pipelines">Unit Testing Pipelines</h2>
+<h2 id="unit-testing-pipelines">Unit Testing Pipelines<a class="headerlink" href="#unit-testing-pipelines" title="Permanent link">&para;</a></h2>
 <p>For production data pipelines, unit tests are an absolute must. The <a href="#mempipeline">MemPipeline</a> implementation of the Pipeline
 interface has several tools to help developers create effective unit tests, which will be detailed in this section.</p>
-<h3 id="unit-testing-dofns">Unit Testing DoFns</h3>
+<h3 id="unit-testing-dofns">Unit Testing DoFns<a class="headerlink" href="#unit-testing-dofns" title="Permanent link">&para;</a></h3>
 <p>Many of the DoFn implementations, such as <code>MapFn</code> and <code>FilterFn</code>, are very easy to test, since they accept a single input
 and return a single output. For general purpose DoFns, we need an instance of the <a href="apidocs/0.10.0/org/apache/crunch/Emitter.html">Emitter</a>
 interface that we can pass to the DoFn's <code>process</code> method and then read in the values that are written by the function. Support
@@ -1497,7 +1508,7 @@ has a <code>List&lt;T&gt; getOutput()</c
-<h3 id="testing-complex-dofns-and-pipelines">Testing Complex DoFns and Pipelines</h3>
+<h3 id="testing-complex-dofns-and-pipelines">Testing Complex DoFns and Pipelines<a class="headerlink" href="#testing-complex-dofns-and-pipelines" title="Permanent link">&para;</a></h3>
 <p>Many of the DoFns we write involve more complex processing that require that our DoFn be initialized and cleaned up, or that
 define Counters that we use to track the inputs that we receive. In order to ensure that our DoFns are working properly across
 their entire lifecycle, it's best to use the <a href="#mempipeline">MemPipeline</a> implementation to create in-memory instances of
@@ -1532,7 +1543,7 @@ those Counters between test runs by call
-<h3 id="designing-testable-data-pipelines">Designing Testable Data Pipelines</h3>
+<h3 id="designing-testable-data-pipelines">Designing Testable Data Pipelines<a class="headerlink" href="#designing-testable-data-pipelines" title="Permanent link">&para;</a></h3>
 <p>In the same way that we try to <a href="">write testable code</a>, we want to ensure that
 our data pipelines are written in a way that makes them easy to test. In general, you should try to break up complex pipelines
 into a number of function calls that perform a small set of operations on input PCollections and return one or more PCollections
@@ -1576,7 +1587,7 @@ is taken from one of Crunch's integratio
 computations that combine custom DoFns with Crunch's built-in <code>cogroup</code> operation by using the <a href="#mempipeline">MemPipeline</a>
 implementation to create test data sets that we can easily verify by hand, and then this same logic can be executed on
 a distributed data set using either the <a href="#mrpipeline">MRPipeline</a> or <a href="#sparkpipeline">SparkPipeline</a> implementations.</p>
-<h3 id="pipeline-execution-plan-visualizations">Pipeline execution plan visualizations</h3>
+<h3 id="pipeline-execution-plan-visualizations">Pipeline execution plan visualizations<a class="headerlink" href="#pipeline-execution-plan-visualizations" title="Permanent link">&para;</a></h3>
 <p>Crunch provides tools to visualize the pipeline execution plan. The <a href="apidocs/0.10.0/org/apache/crunch/PipelineExecution.html">PipelineExecution</a>  <code>String getPlanDotFile()</code> method returns a DOT format visualization of the exaction plan. Furthermore if the output folder is set then Crunch will save the dotfile diagram on each pipeline execution: </p>
 <div class="codehilite"><pre>    <span class="n">Configuration</span> <span class="n">conf</span> <span class="p">=...;</span>     
     <span class="n">String</span> <span class="n">dotfileDir</span> <span class="p">=...;</span>

View raw message