crunch-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r1545498 - /crunch/site/trunk/content/intro.mdtext
Date Tue, 26 Nov 2013 01:28:17 GMT
Author: jwills
Date: Tue Nov 26 01:28:17 2013
New Revision: 1545498

Add details on PType serialization


Modified: crunch/site/trunk/content/intro.mdtext
--- crunch/site/trunk/content/intro.mdtext (original)
+++ crunch/site/trunk/content/intro.mdtext Tue Nov 26 01:28:17 2013
@@ -135,8 +135,8 @@ we cannot know at runtime what type of d
 us with an object that contains this information: in our example word count application,
the object that tells us that we are working with strings is
 returned by the `Writables.strings()` static method that is the third argument to the `parallelDo`
function in `countWords`. Every `DoFn` instance must
 return a type that has an associated object, called a `PType<T>`, that contains instructions
for how to serialize the data returned by that `DoFn`. By default, Crunch
-supports two serialization frameworks, called _type families_: one based on Hadoop's `Writable`
interface, and another based on `Apache Avro`.
-You can read more about how to work with Crunch's serialization libraries here. TODO
+supports two serialization frameworks, called _type families_: one based on Hadoop's `Writable`
interface, and another based on `Apache Avro`. Details
+on the type families are contained in the section on "Serializing Data with PTypes" in this
 Because all of the core logic in our application is exposed via a single static method that
operates on Crunch interfaces, we can use Crunch's
 in-memory API to test our business logic using a unit testing framework like JUnit. Let's
look at an example unit test for the word count
@@ -307,7 +307,34 @@ interface defined via static factory met
 ### Serializing Data with PTypes
-Why PTypes Are Necessary, the two type families, the core methods and tuples.
+Every `PCollection<T>` has an associated `PType<T>` that encapsulates the information
on how to serialize and deserialize the contents of that
+PCollection. PTypes are necessary because of [type erasure](;
at runtime, when
+the Crunch planner is mapping from PCollections to a series of MapReduce jobs, the type of
a PCollection (that is, the `T` in `PCollection<T>`)
+is no longer available to us, and must be provided by the associated PType instance.
+Crunch supports two independent _type families_, which each implement the [PTypeFamily](apidocs/0.8.0/org/apache/crunch/types/PTypeFamily.html)
+one for Hadoop's [Writable interface](apidocs/0.8.0/org/apache/crunch/types/writable/WritableTypeFamily.html)
and another based on
+[Apache Avro](apidocs/0.8.0/org/apache/crunch/types/avro/AvroTypeFamily.html). There are
also classes that contain static factory methods for
+each PTypeFamily to allow for easy import and usage: one for [Writables](apidocs/0.8.0/org/apache/crunch/types/writable/Writables.html)
and one for
+The two different type families exist for historical reasons: Writables have long been the
standard form for representing serializable data in Hadoop,
+but the Avro based serialization scheme is very compact, fast, and allows for complex record
schemas to evolve over time. It's fine (and even encouraged)
+to mix-and-match PCollections that use different PTypes in the same Crunch pipeline (e.g.,
you could
+read in Writable data, do a shuffle using Avro, and then write the output data as Writables),
but each PCollection's PType must belong to a single
+type family; for example, you cannot have a PTable whose key is serialized as a Writable
and whose value is serialized as an Avro record.
+#### Core PTypes
+Both type families support a common set of primitive types (strings, longs, ints, floats,
doubles, booleans, and bytes) as well as more complex
+PTypes that can be constructed out of other PTypes:
+1. Tuples of other PTypes (`pairs`, `trips`, `quads`, and `tuples` for arbitrary N),
+2. Collections of other PTypes (`collections` to create a `Collection<T>` and `maps`
to return a `Map<String, T>`),
+3. and `tableOf` to construct a `PTableType<K, V>`, the PType used to distinguish a
`PTable<K, V>` from a `PCollection<Pair<K, V>>`.
+Both of the type families have additional methods for working with records that are specific
to each serialization format (for example, the
+AvroTypeFamily contains methods to support Generic and Specific records as well as Avro's
reflection-based serialization.)
 #### Extending PTypes

View raw message