crunch-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r887976 - in /websites/staging/crunch/trunk/content: ./ intro.html
Date Mon, 25 Nov 2013 05:21:07 GMT
Author: buildbot
Date: Mon Nov 25 05:21:07 2013
New Revision: 887976

Log:
Staging update by buildbot for crunch

Modified:
    websites/staging/crunch/trunk/content/   (props changed)
    websites/staging/crunch/trunk/content/intro.html

Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Nov 25 05:21:07 2013
@@ -1 +1 @@
-1544354
+1545153

Modified: websites/staging/crunch/trunk/content/intro.html
==============================================================================
--- websites/staging/crunch/trunk/content/intro.html (original)
+++ websites/staging/crunch/trunk/content/intro.html Mon Nov 25 05:21:07 2013
@@ -250,7 +250,7 @@ return a type that has an associated obj
 supports two serialization frameworks, called <em>type families</em>: one based
on Hadoop's <code>Writable</code> interface, and another based on <code>Apache
Avro</code>.
 You can read more about how to work with Crunch's serialization libraries here. TODO</p>
 <p>Because all of the core logic in our application is exposed via a single static
method that operates on Crunch interfaces, we can use Crunch's
-in-memory API to test our business logic using a unit testing framework like JUnit. Let's
look at an exampel unit test for the word count
+in-memory API to test our business logic using a unit testing framework like JUnit. Let's
look at an example unit test for the word count
 application:</p>
 <div class="codehilite"><pre><span class="n">package</span> <span
class="n">org</span><span class="p">.</span><span class="n">myorg</span><span
class="p">;</span>
 
@@ -283,51 +283,55 @@ Collections Classes like <code>java.util
 pipeline into the client and make decisions based on that data allows us to create sophisticated
analytical
 applications that can modify their downstream processing based on the results of upstream
computations.</p>
 <h3 id="data-model-and-operators">Data Model and Operators</h3>
-<p>The Java API is centered around three interfaces that represent distributed datasets:
<code>PCollection&lt;T&gt;</code>, <code>PTable&lt;K, V&gt;</code>,
and <code>PGroupedTable&lt;K, V&gt;</code>.</p>
+<p>The Java API is centered around three interfaces that represent distributed datasets:
<a href="apidocs/current/org/apache/crunch/PCollection.html">PCollection<T></a>,
+<a href="http://crunch.apache.org/apidocs/current/org/apache/crunch/PTable.html">PTable<K,
V></a>, and <a href="apidocs/current/org/apache/crunch/PGroupedTable.html">PGroupedTable<K,
V></a>.</p>
 <p>A <code>PCollection&lt;T&gt;</code> represents a distributed,
unordered collection of elements of type T. For example, we represent a text file as a
-<code>PCollection&lt;String&gt;</code> object. PCollection provides a
method, <code>parallelDo</code>, that applies a <code>DoFn</code>
to each element in a PCollection in parallel,
-and returns a new PCollection as its result. </p>
-<p>A <code>PTable&lt;K, V&gt;</code> is a sub-interface of PCollection
that represents a distributed, unordered multimap of its key type K to its value type V.
+<code>PCollection&lt;String&gt;</code> object. <code>PCollection&lt;T&gt;</code>
provides a method, <code>parallelDo</code>, that applies a <a href="apidocs/current/org/apache/crunch/DoFn.html">DoFn<T,
U></a>
+to each element in the <code>PCollection&lt;T&gt;</code> in parallel,
and returns an new <code>PCollection&lt;U&gt;</code> as its result.</p>
+<p>A <code>PTable&lt;K, V&gt;</code> is a sub-interface of <code>PCollection&lt;Pair&lt;K,
V&gt;&gt;</code> that represents a distributed, unordered multimap of its key
type K to its value type V.
 In addition to the parallelDo operation, PTable provides a <code>groupByKey</code>
operation that aggregates all of the values in the PTable that
-have the same key into a single record. It is the groupByKey operation that triggers the
sort phase of a MapReduce job.</p>
-<p>The result of a groupByKey operation is a <code>PGroupedTable&lt;K, V&gt;</code>
object, which is a distributed, sorted map of keys of type K to an Iterable
-collection of values of type V. In addition to parallelDo, the PGroupedTable provides a <code>combineValues</code>
operation, which allows for
-a commutative and associative aggregation operator to be applied to the values of the PGroupedTable
instance on both the map side and the
-reduce side of a MapReduce job.</p>
+have the same key into a single record. It is the groupByKey operation that triggers the
sort phase of a MapReduce job. Developers can exercise
+fine-grained control over the number of reducers and the partitioning, grouping, and sorting
strategies used during the shuffle by providing an instance
+of the <a href="apidocs/current/org/apache/crunch/GroupingOptions.html">GroupingOptions</a>
class to the <code>groupByKey</code> function.</p>
+<p>The result of a groupByKey operation is a <code>PGroupedTable&lt;K, V&gt;</code>
object, which is a distributed, sorted map of keys of type K to an Iterable<V> that
may
+be iterated over exactly once. In addition to <code>parallelDo</code> processing
via DoFns, PGroupedTable provides a <code>combineValues</code> operation that
allows a
+commutative and associative <a href="apidocs/current/org/apache/crunch/Aggregator.html">Aggregator<V></a>
to be applied to the values of the PGroupedTable
+instance on both the map and reduce sides of the shuffle. A number of common <code>Aggregator&lt;V&gt;</code>
implementations are provided in the
+<a href="apidocs/current/org/apache/crunch/fn/Aggregators.html">Aggregators</a>
class.</p>
 <p>Finally, PCollection, PTable, and PGroupedTable all support a <code>union</code>
operation, which takes a series of distinct PCollections that all have
 the same data type and treats them as a single virtual PCollection.</p>
-<p>All of the other MapReduce patterns supported by the Crunch APIs (aggregations,
joins, sorts, secondary sorts, and cogrouping) are all implemented
-in terms of these four primitives. The patterns themselves are defined in the <code>org.apache.crunch.lib</code>
package and its children, and a few of
-the most common patterns have convenience functions defined on the PCollection and PTable
interfaces. We will do a more detailed review of these
-patterns later in this document, but here are a few examples to get you started: TODO</p>
+<p>All of the other data transformation operations supported by the Crunch APIs (aggregations,
joins, sorts, secondary sorts, and cogrouping) are implemented
+in terms of these four primitives. The patterns themselves are defined in the <a href="apidocs/current/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
+package and its children, and a few of of the most common patterns have convenience functions
defined on the PCollection and PTable interfaces.</p>
 <h3 id="writing-dofns">Writing DoFns</h3>
 <p>DoFns represent the logical computations of your Crunch pipelines. They are designed
to be easy to write, easy to test, and easy to deploy
 within the context of a MapReduce job. Much of your work with the Crunch APIs will be writing
DoFns, and so having a good understanding of
 how to use them effectively is critical to crafting elegant and efficient pipelines.</p>
 <h4 id="dofn-extends-serializable">DoFn extends Serializable</h4>
 <p>The most important thing to remember about DoFns is that they all implement the
<code>java.io.Serializable</code> interface, which means that all of the
-state information associated with a DoFn must also be serializable. There is an excellent
overview of Java serializability here that is worth
-reviewing if you aren't familiar with Java's serializability model. TODO</p>
-<p>If your DoFn needs to work with a class that does not implement Serializable and
cannot be modified (e.g., because it is defined in a third-party
+state information associated with a DoFn must also be serializable. There is an <a href="http://docs.oracle.com/javase/tutorial/jndi/objects/serial.html">excellent
overview of Java serializability</a> that is worth reviewing if you aren't familiar
with it already.</p>
+<p>If your DoFn needs to work with a class that does not implement Serializable and
cannot be modified (for example, because it is defined in a third-party
 library), you should use the <code>transient</code> keyword on that member variable
so that serializing the DoFn won't fail if that object happens to be
 defined. You can create an instance of the object during runtime using the <code>initialize</code>
method described in the following section.</p>
 <h4 id="runtime-processing-steps">Runtime Processing Steps</h4>
 <p>After the Crunch runtime loads the serialized DoFns into its map and reduce tasks,
the DoFns are executed on the input data via the following
 sequence:</p>
-<h1 id="first-the-dofn-is-given-access-to-the-taskinputoutputcontext-implementation-for-the-current-task-this-allows-the-dofn-to-access-any">First,
the DoFn is given access to the <code>TaskInputOutputContext</code> implementation
for the current task. This allows the DoFn to access any</h1>
-<p>necessary configuration and runtime information needed before or during processing.</p>
-<h1 id="next-the-dofns-initialize-method-is-called-the-initialize-method-is-similar-to-the-setup-method-used-in-the-mapper-and-reducer-classes">Next,
the DoFn's <code>initialize</code> method is called. The initialize method is
similar to the <code>setup</code> method used in the Mapper and Reducer classes;</h1>
-<p>it is called before processing begins in order to enable any necessary initialization
or configuration of the DoFn to be performed. For example,
-if we were making use of a non-serializable third-party library, we would create an instance
of it here.</p>
-<h1 id="at-this-point-data-processing-begins-the-map-or-reduce-task-will-begin-passing-records-in-to-the-dofns-process-method-and-capturing-the">At
this point, data processing begins. The map or reduce task will begin passing records in to
the DoFn's <code>process</code> method, and capturing the</h1>
-<p>output of the process method into an <code>Emitter&lt;T&gt;</code>
that can either pass the data along to another DoFn for processing or serialize it as the
output
-of the current processing stage.</p>
-<h1 id="finally-after-all-of-the-records-have-been-processed-the-void-cleanupemittert-emitter-method-is-called-on-each-dofn-the-cleanup-method">Finally,
after all of the records have been processed, the <code>void cleanup(Emitter&lt;T&gt;
emitter)</code> method is called on each DoFn. The cleanup method</h1>
-<p>has a dual purpose: it can be used to emit any state information that the DoFn wants
to pass along to the next stage (for example, cleanup could
+<ol>
+<li>First, the DoFn is given access to the <code>TaskInputOutputContext</code>
implementation for the current task. This allows the DoFn to access any
+necessary configuration and runtime information needed before or during processing.</li>
+<li>Next, the DoFn's <code>initialize</code> method is called. The initialize
method is similar to the <code>setup</code> method used in the Mapper and Reducer
classes;
+it is called before processing begins in order to enable any necessary initialization or
configuration of the DoFn to be performed. For example,
+if we were making use of a non-serializable third-party library, we would create an instance
of it here.</li>
+<li>At this point, data processing begins. The map or reduce task will begin passing
records in to the DoFn's <code>process</code> method, and capturing the
+output of the process method into an <code>Emitter&lt;T&gt;</code> that
can either pass the data along to another DoFn for processing or serialize it as the output
+of the current processing stage.</li>
+<li>Finally, after all of the records have been processed, the <code>void cleanup(Emitter&lt;T&gt;
emitter)</code> method is called on each DoFn. The cleanup method
+has a dual purpose: it can be used to emit any state information that the DoFn wants to pass
along to the next stage (for example, cleanup could
 be used to emit the sum of a list of numbers that was passed in to the DoFn's process method),
as well as to release any resources or perform any
-other cleanup task that is appropriate once the job has finished executing.</p>
+other cleanup task that is appropriate once the job has finished executing.</li>
+</ol>
 <h4 id="accessing-runtime-mapreduce-apis">Accessing Runtime MapReduce APIs</h4>
-<p>DoFns provide direct access to the <code>TaskInputOutputContext</code>
object that is used within a given Map or Reduce task via the protected <code>getContext</code>
+<p>DoFns provide direct access to the <code>TaskInputOutputContext</code>
object that is used within a given Map or Reduce task via the <code>getContext</code>
 method. There are also a number of helper methods for working with the objects associated
with the TaskInputOutputContext, including:</p>
 <ul>
 <li><code>getConfiguration()</code> for accessing the <code>Configuration</code>
object that contains much of the detail about system and user-specific parameters for a
@@ -337,57 +341,86 @@ framework won't kill it,</li>
 <li><code>setStatus(String status)</code> and <code>getStatus</code>
for setting task status information, and</li>
 <li><code>getTaskAttemptID()</code> for accessing the current <code>TaskAttemptID</code>
information.</li>
 </ul>
-<p>Crunch provides a number of helper methods, all named <code>increment</code>
and having various signatures, for working with Hadoop Counters.
-There was a change in the Counters API from Hadoop 1.0 to Hadoop 2.0, and thus we do not
recommend that you work with the <code>Counter</code> classes
-directly in your Crunch pipelines (the two <code>getCounter</code> methods that
were defined in DoFn are both deprecated) so that you will not be
-required to recompile your job jars when you move from a Hadoop 1.x cluster to a Hadoop 2.x
cluster.</p>
+<p>Crunch provides a number of helper methods for working with <a href="http://codingwiththomas.blogspot.com/2011/04/controlling-hadoop-job-recursion.html">Hadoop
Counters</a>, all named <code>increment</code>. Counters are an incredibly
useful way of keeping track of the state of long running data pipelines and detecting any
exceptional conditions that
+occur during processing, and they are supported in both the MapReduce-based and in-memory
Crunch pipeline contexts. You can retrive the value of the Counters
+in your client code at the end of a MapReduce pipeline by getting them from the <a href="apidocs/current/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a>
+objects returned by Crunch at the end of a run.</p>
+<p>(Note that there was a change in the Counters API from Hadoop 1.0 to Hadoop 2.0,
and thus we do not recommend that you work with the
+Counter classes directly in yoru Crunch pipelines (the two <code>getCounter</code>
methods that were defined in DoFn are both deprecated) so that you will not be
+required to recompile your job jars when you move from a Hadoop 1.0 cluster to a Hadoop 2.0
cluster.)</p>
 <h4 id="configuring-the-crunch-planner-and-mapreduce-jobs-with-dofns">Configuring the
Crunch Planner and MapReduce Jobs with DoFns</h4>
 <p>Although most of the DoFn methods are focused on runtime execution, there are a
handful of methods that are used during the planning phase
 before a pipeline is converted into MapReduce jobs. The first of these functions is <code>float
scaleFactor()</code>, which should return a floating point
 value greater than 0.0f. You can override the scaleFactor method in your custom DoFns in
order to provide a hint to the Crunch planner about
-how much larger (or smaller) an input data set will become after passing through the process
method. If the groupByKey method is called without
+how much larger (or smaller) an input data set will become after passing through the process
method. If the <code>groupByKey</code> method is called without
 an explicit number of reducers provided, the planner will try to guess how many reduce tasks
should be used for the job based on the size of
-the input data, which is determined in part by using the scaleFactor results.</p>
+the input data, which is determined in part by using the result of calling the <code>scaleFactor</code>
method on the DoFns in the processing path.</p>
 <p>Sometimes, you may know that one of your DoFns has some unusual parameter settings
that need to be specified on any job that includes that
 DoFn as part of its processing. A DoFn can modify the Hadoop Configuration object that is
associated with the MapReduce job it is assigned to
 on the client before processing begins by overriding the <code>void configure(Configuration
conf)</code> method. For example, you might know that the DoFn
 will require extra memory settings to run, and so you could make sure that the value of the
<code>mapred.child.java.opts</code> argument had a large enough
 memory setting for the DoFn's needs before the job was launched on the cluster.</p>
-<h4 id="dofn-extensions-and-helper-classes">DoFn Extensions and Helper Classes</h4>
+<h4 id="common-dofn-patterns">Common DoFn Patterns</h4>
 <p>The Crunch APIs contain a number of useful subclasses of DoFn that handle common
data processing scenarios and are easier
-to write and test. The top-level <code>org.apache.crunch</code> package contains
three of the most important specializations, which we will
-discuss now. Each of these specialized DoFn implementations has associated methods on the
PCollection, PTable, and PGroupedTable
-interfaces to support these common data processing tasks.</p>
-<p>The simplest extension is the <code>FilterFn&lt;T&gt;</code>
class, which defines a single abstract method, <code>boolean accept(T input)</code>.
The FilterFn can be applied
-to a <code>PCollection&lt;T&gt;</code> by calling the <code>filter(FilterFn&lt;T&gt;
fn)</code> method, and will return a new <code>PCollection&lt;T&gt;</code>
that only contains the elements
-of the input PCollection for which the accept method returned true. Note that the filter
function does not include a PType argument in its
+to write and test. The top-level <a href="apidocs/current/org/apache/crunch/package-summary.html">org.apache.crunch</a>
package contains three
+of the most important specializations, which we will discuss now. Each of these specialized
DoFn implementations has associated methods
+on the PCollection, PTable, and PGroupedTable interfaces to support common data processing
steps.</p>
+<p>The simplest extension is the <a href="apidocs/current/org/apache/crunch/FilterFn.html">FilterFn<T></a>
class, which defines a single abstract method, <code>boolean accept(T input)</code>.
+The FilterFn can be applied to a <code>PCollection&lt;T&gt;</code> by
calling the <code>filter(FilterFn&lt;T&gt; fn)</code> method, and will
return a new <code>PCollection&lt;T&gt;</code> that only contains
+the elements of the input PCollection for which the accept method returned true. Note that
the filter function does not include a PType argument in its
 signature, because there is no change in the data type of the PCollection when the FilterFn
is applied. It is possible to compose new FilterFn
-instances by combining multiple FilterFns together using the <code>and</code>,
<code>or</code>, and <code>not</code> factory methods defined in the
FilterFns helper class.</p>
-<p>The second extension is the <code>MapFn&lt;S, T&gt;</code> class,
which defines a single abstract method, <code>T map(S input)</code>. For simple
transform tasks in which
-every input record will have exactly one output, it's easy to test a MapFn by verifying that
a given input returns a given output. MapFns are
-also used by Crunch's data serialization libraries to map between serialized data types (such
as Writables or Avro records) and POJOs.</p>
+instances by combining multiple FilterFns together using the <code>and</code>,
<code>or</code>, and <code>not</code> factory methods defined in the
+<a href="apidocs/current/org/apache/crunch/fn/FilterFns.html">FilterFns</a> helper
class.</p>
+<p>The second extension is the <a href="apidocs/current/org/apache/crunch/MapFn.html">MapFn<S,
T></a> class, which defines a single abstract method, <code>T map(S input)</code>.
+For simple transform tasks in which every input record will have exactly one output, it's
easy to test a MapFn by verifying that a given input returns a
+every input record will have exactly one output, it's easy to test a MapFn by verifying that
a given input returns a given output.</p>
 <p>MapFns are also used in specialized methods on the PCollection and PTable interfaces.
<code>PCollection&lt;V&gt;</code> defines the method
 <code>PTable&lt;K,V&gt; by(MapFn&lt;V, K&gt; mapFn, PType&lt;K&gt;
keyType)</code> that can be used to create a PTable from a PCollection by writing a
 function that extracts the key (of type K) from the value (of type V) contained in the PCollection.
The by function only requires that the PType of
 the key be given and constructs a <code>PTableType&lt;K, V&gt;</code>
from the given key type and the PCollection's existing value type. <code>PTable&lt;K,
V&gt;</code>, in turn,
 has methods <code>PTable&lt;K1, V&gt; mapKeys(MapFn&lt;K, K1&gt; mapFn)</code>
and <code>PTable&lt;K, V2&gt; mapValues(MapFn&lt;V, V2&gt;)</code>
that handle the common case of converting
 just one of the paired values in a PTable instance from one type to another while leaving
the other type the same.</p>
-<p>The final top-level extension to DoFn is the <code>CombineFn&lt;K, V&gt;</code>
class, which is used in conjunction with the <code>combineValues</code> method
defined on the
-PGroupedTable interface. CombineFns are used to represent the associative operations that
can be applied using the MapReduce Combiner concept in
-order to reduce the amount of data that is shipped over the network during the shuffle. The
CombineFn extension is different from the FilterFn and
-MapFn classes in that it does not define an abstract method for handling data besides the
default <code>process</code> method that any other DoFn would use;
-rather, extending the CombineFn class signals to the Crunch planner that the logic contained
in this class satisfies the conditions required for use
-with the MapReduce combiner. Crunch supports many types of these associative patterns, such
as sums, counts, and set unions, via the <code>Aggregator&lt;V&gt;</code>
interface,
-which is defined right alongside the CombineFn class in the top-level <code>org.apache.crunch</code>
package. There are a number of implementations of the Aggregator
-interface defined via static factory methods in the <code>org.apache.crunch.fn.Aggregators</code>
class.</p>
+<p>The final top-level extension to DoFn is the <a href="apidocs/current/org/apache/crunch/CombineFn.html">CombineFn<K,
V></a> class, which is used in conjunction with
+the <code>combineValues</code> method defined on the PGroupedTable interface.
CombineFns are used to represent the associative operations that can be applied using
+the MapReduce Combiner concept in order to reduce the amount data that is shipped over the
network during a shuffle.</p>
+<p>The CombineFn extension is different from the FilterFn and MapFn classes in that
it does not define an abstract method for handling data
+beyond the default <code>process</code> method that any other DoFn would use;
rather, extending the CombineFn class signals to the Crunch planner that the logic
+contained in this class satisfies the conditions required for use with the MapReduce combiner.</p>
+<p>Crunch supports many types of these associative patterns, such as sums, counts,
and set unions, via the <a href="apidocs/current/org/apache/crunch/Aggregator.html">Aggregator<V></a>
+interface, which is defined right alongside the CombineFn class in the top-level <code>org.apache.crunch</code>
package. There are a number of implementations of the Aggregator
+interface defined via static factory methods in the <a href="apidocs/current/org/apache/crunch/fn/Aggregators.html">Aggregators</a>
class.</p>
 <h3 id="serializing-data-with-ptypes">Serializing Data with PTypes</h3>
 <p>Why PTypes Are Necessary, the two type families, the core methods and tuples.</p>
 <h4 id="extending-ptypes">Extending PTypes</h4>
-<h3 id="reading-data-sources">Reading Data: Sources</h3>
-<h3 id="writing-data-targets">Writing Data: Targets</h3>
+<p>The simplest way to create a new <code>PType&lt;T&gt;</code>
for a data object is to create a <em>derived</em> PType from one of the built-in
PTypes for the Avro
+and Writable type families. If we have a base <code>PType&lt;S&gt;</code>,
we can create a derived <code>PType&lt;T&gt;</code> by implementing an
input <code>MapFn&lt;S, T&gt;</code> and an
+output <code>MapFn&lt;T, S&gt;</code> and then calling <code>PTypeFamily.derived(Class&lt;T&gt;,
MapFn&lt;S, T&gt; in, MapFn&lt;T, S&gt; out, PType&lt;S&gt; base)</code>,
which will return
+a new <code>PType&lt;T&gt;</code>. There are examples of derived PTypes
in the <a href="apidocs/current/org/apache/crunch/types/PTypes.html">PTypes</a>
class, including
+serialization support for protocol buffers, Thrift records, Java Enums, BigInteger, and UUIDs.</p>
+<h3 id="reading-and-writing-data-sources-targets-and-sourcetargets">Reading and Writing
Data: Sources, Targets, and SourceTargets</h3>
+<p>MapReduce developers are familiar with the <code>InputFormat&lt;K, V&gt;</code>
and <code>OutputFormat&lt;K, V&gt;</code> classes for reading and writing
data during
+MapReduce processing. Crunch has the analogous concepts of a <code>Source&lt;T&gt;</code>
for reading data and a <code>Target</code> for writing data. For data
+sources that may be treated as both the output of one pipeline phase and the input to another,
Crunch has a <code>SourceTarget&lt;T&gt;</code> interface
+that combines the functionality of both <code>Source&lt;T&gt;</code>
and <code>Target</code>.</p>
+<p>Sources and Targets provide several useful extensions to the functionality provided
by InputFormat and OutputFormat. First, a Source can
+encapsulate an InputFormat as well as any special Configuration settings that are needed
by that InputFormat. For example, the
+<code>AvroInputFormat</code> needs to know the Avro schema of the input Avro
file and expects to find that schema associated with the "avro.schema" key
+in the <code>Configuration</code> object for a pipeline. But if you need to read
multiple Avro files, each with its own schema, during a single MapReduce
+job, you need a way of ensuring that the different schemas for each file do not all overwrite
the "avro.schema" key in the shared
+<code>Configuration</code> object. Crunch's <code>Source&lt;T&gt;</code>
allows you to specify a set of key-value entries that need to be set in the <code>Configuration</code>
+before a particular input is read in a way that prevents them from conflicting with each
other, while the Target interface provides the same
+functionality for OutputFormats.</p>
+<p>The <code>Source&lt;T&gt;</code> interface has two useful extensions.
The first is <code>TableSource&lt;K, V&gt;</code> which extends <code>Source&lt;Pair&lt;K,
V&gt;&gt;</code> and can be
+used to read in a <code>PTable&lt;K, V&gt;</code> instance instead of
a <code>PCollection&lt;Pair&lt;K, V&gt;&gt;</code> instance. The
second extension is <code>ReadableSource&lt;T&gt;</code>, which
+declares a <code>Iterable&lt;T&gt; read(Configuration conf)</code> method
that allows the contents of the Source to be read directly, either into the client
+or into a DoFn implementation that can use the data read from the source to perform additional
transforms on the main input data that is
+processed using the DoFn's <code>process</code> method (this is how Crunch supports
mapside-join operations.)</p>
+<p>Support for the most common Source, Target, and SourceTarget implementations are
provided by the factory functions declared in the
+<a href="apidocs/current/org/apache/crunch/io/From.html">From</a> (Sources),
<a href="apidocs/current/org/apache/crunch/io/To.html">To</a> (Targets), and
+<a href="apidocs/current/org/apache/crunch/io/At.html">At</a> (SourceTargets)
classes in the <a href="apidocs/current/org/apache/crunch/io/package-summary.html">org.apache.crunch.io</a>
+package.</p>
 <h3 id="pipeline-building-and-execution">Pipeline Building and Execution</h3>
 <h4 id="creating-a-new-crunch-pipeline">Creating A New Crunch Pipeline</h4>
-<p>Section here on Configuration of pipelines.</p>
 <h4 id="managing-pipeline-execution-and-cleanup">Managing Pipeline Execution and Cleanup</h4>
 <h2 id="more-information">More Information</h2>
 <p><a href="pipelines.html">Writing Your Own Pipelines</a></p>



Mime
View raw message