crunch-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r888090 - in /websites/staging/crunch/trunk/content: ./ intro.html
Date Tue, 26 Nov 2013 01:28:28 GMT
Author: buildbot
Date: Tue Nov 26 01:28:28 2013
New Revision: 888090

Staging update by buildbot for crunch

    websites/staging/crunch/trunk/content/   (props changed)

Propchange: websites/staging/crunch/trunk/content/
--- cms:source-revision (original)
+++ cms:source-revision Tue Nov 26 01:28:28 2013
@@ -1 +1 @@

Modified: websites/staging/crunch/trunk/content/intro.html
--- websites/staging/crunch/trunk/content/intro.html (original)
+++ websites/staging/crunch/trunk/content/intro.html Tue Nov 26 01:28:28 2013
@@ -247,8 +247,8 @@ we cannot know at runtime what type of d
 us with an object that contains this information: in our example word count application,
the object that tells us that we are working with strings is
 returned by the <code>Writables.strings()</code> static method that is the third
argument to the <code>parallelDo</code> function in <code>countWords</code>.
Every <code>DoFn</code> instance must
 return a type that has an associated object, called a <code>PType&lt;T&gt;</code>,
that contains instructions for how to serialize the data returned by that <code>DoFn</code>.
By default, Crunch
-supports two serialization frameworks, called <em>type families</em>: one based
on Hadoop's <code>Writable</code> interface, and another based on <code>Apache
-You can read more about how to work with Crunch's serialization libraries here. TODO</p>
+supports two serialization frameworks, called <em>type families</em>: one based
on Hadoop's <code>Writable</code> interface, and another based on <code>Apache
Avro</code>. Details
+on the type families are contained in the section on "Serializing Data with PTypes" in this
 <p>Because all of the core logic in our application is exposed via a single static
method that operates on Crunch interfaces, we can use Crunch's
 in-memory API to test our business logic using a unit testing framework like JUnit. Let's
look at an example unit test for the word count
@@ -390,7 +390,30 @@ contained in this class satisfies the co
 interface, which is defined right alongside the CombineFn class in the top-level <code>org.apache.crunch</code>
package. There are a number of implementations of the Aggregator
 interface defined via static factory methods in the <a href="apidocs/0.8.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a>
 <h3 id="serializing-data-with-ptypes">Serializing Data with PTypes</h3>
-<p>Why PTypes Are Necessary, the two type families, the core methods and tuples.</p>
+<p>Every <code>PCollection&lt;T&gt;</code> has an associated <code>PType&lt;T&gt;</code>
that encapsulates the information on how to serialize and deserialize the contents of that
+PCollection. PTypes are necessary because of <a href="">type
erasure</a>; at runtime, when
+the Crunch planner is mapping from PCollections to a series of MapReduce jobs, the type of
a PCollection (that is, the <code>T</code> in <code>PCollection&lt;T&gt;</code>)
+is no longer available to us, and must be provided by the associated PType instance.</p>
+<p>Crunch supports two independent <em>type families</em>, which each implement
the <a href="apidocs/0.8.0/org/apache/crunch/types/PTypeFamily.html">PTypeFamily</a>
+one for Hadoop's <a href="apidocs/0.8.0/org/apache/crunch/types/writable/WritableTypeFamily.html">Writable
interface</a> and another based on
+<a href="apidocs/0.8.0/org/apache/crunch/types/avro/AvroTypeFamily.html">Apache Avro</a>.
There are also classes that contain static factory methods for
+each PTypeFamily to allow for easy import and usage: one for <a href="apidocs/0.8.0/org/apache/crunch/types/writable/Writables.html">Writables</a>
and one for
+<a href="apidocs/0.8.0/org/apache/crunch/types/avro/Avros.html">Avros</a>.</p>
+<p>The two different type families exist for historical reasons: Writables have long
been the standard form for representing serializable data in Hadoop,
+but the Avro based serialization scheme is very compact, fast, and allows for complex record
schemas to evolve over time. It's fine (and even encouraged)
+to mix-and-match PCollections that use different PTypes in the same Crunch pipeline (e.g.,
you could
+read in Writable data, do a shuffle using Avro, and then write the output data as Writables),
but each PCollection's PType must belong to a single
+type family; for example, you cannot have a PTable whose key is serialized as a Writable
and whose value is serialized as an Avro record.</p>
+<h4 id="core-ptypes">Core PTypes</h4>
+<p>Both type families support a common set of primitive types (strings, longs, ints,
floats, doubles, booleans, and bytes) as well as more complex
+PTypes that can be constructed out of other PTypes:</p>
+<li>Tuples of other PTypes (<code>pairs</code>, <code>trips</code>,
<code>quads</code>, and <code>tuples</code> for arbitrary N),</li>
+<li>Collections of other PTypes (<code>collections</code> to create a <code>Collection&lt;T&gt;</code>
and <code>maps</code> to return a <code>Map&lt;String, T&gt;</code>),</li>
+<li>and <code>tableOf</code> to construct a <code>PTableType&lt;K,
V&gt;</code>, the PType used to distinguish a <code>PTable&lt;K, V&gt;</code>
from a <code>PCollection&lt;Pair&lt;K, V&gt;&gt;</code>.</li>
+<p>Both of the type families have additional methods for working with records that
are specific to each serialization format (for example, the
+AvroTypeFamily contains methods to support Generic and Specific records as well as Avro's
reflection-based serialization.)</p>
 <h4 id="extending-ptypes">Extending PTypes</h4>
 <p>The simplest way to create a new <code>PType&lt;T&gt;</code>
for a data object is to create a <em>derived</em> PType from one of the built-in
PTypes for the Avro
 and Writable type families. If we have a base <code>PType&lt;S&gt;</code>,
we can create a derived <code>PType&lt;T&gt;</code> by implementing an
input <code>MapFn&lt;S, T&gt;</code> and an

View raw message