beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From j...@apache.org
Subject [2/4] beam-site git commit: Rewrites the section on Coders to not talk about them as a parsing mechanism
Date Mon, 15 May 2017 19:16:32 GMT
Rewrites the section on Coders to not talk about them as a parsing mechanism


Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/0d0da026
Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/0d0da026
Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/0d0da026

Branch: refs/heads/asf-site
Commit: 0d0da0265d8a3ee07493feec835e56efd6acfd85
Parents: 9cc5b22
Author: Eugene Kirpichov <kirpichov@google.com>
Authored: Fri May 12 16:06:09 2017 -0700
Committer: Eugene Kirpichov <kirpichov@google.com>
Committed: Mon May 15 11:28:52 2017 -0700

----------------------------------------------------------------------
 src/documentation/programming-guide.md | 38 ++++++-----------------------
 1 file changed, 8 insertions(+), 30 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/beam-site/blob/0d0da026/src/documentation/programming-guide.md
----------------------------------------------------------------------
diff --git a/src/documentation/programming-guide.md b/src/documentation/programming-guide.md
index 11ec86d..f70e255 100644
--- a/src/documentation/programming-guide.md
+++ b/src/documentation/programming-guide.md
@@ -1175,11 +1175,9 @@ See the  [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built
 
 ## <a name="coders"></a>Data encoding and type safety
 
-When you create or output pipeline data, you'll need to specify how the elements in your
`PCollection`s are encoded and decoded to and from byte strings. Byte strings are used for
intermediate storage as well reading from sources and writing to sinks. The Beam SDKs use
objects called coders to describe how the elements of a given `PCollection` should be encoded
and decoded.
+When Beam runners execute your pipeline, they often need to materialize the intermediate
data in your `PCollection`s, which requires converting elements to and from byte strings.
The Beam SDKs use objects called `Coder`s to describe how the elements of a given `PCollection`
may be encoded and decoded.
 
-### Using coders
-
-You typically need to specify a coder when reading data into your pipeline from an external
source (or creating pipeline data from local data), and also when you output pipeline data
to an external sink.
+> Note that coders are unrelated to parsing or formatting data when interacting with external
data sources or sinks. Such parsing or formatting should typically be done explicitly, using
transforms such as `ParDo` or `MapElements`.
 
 {:.language-java}
 In the Beam SDK for Java, the type `Coder` provides the methods required for encoding and
decoding data. The SDK for Java provides a number of Coder subclasses that work with a variety
of standard Java types, such as Integer, Long, Double, StringUtf8 and more. You can find all
of the available Coder subclasses in the [Coder package](https://github.com/apache/beam/tree/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders).
@@ -1187,38 +1185,18 @@ In the Beam SDK for Java, the type `Coder` provides the methods required
for enc
 {:.language-py}
 In the Beam SDK for Python, the type `Coder` provides the methods required for encoding and
decoding data. The SDK for Python provides a number of Coder subclasses that work with a variety
of standard Python types, such as primitive types, Tuple, Iterable, StringUtf8 and more. You
can find all of the available Coder subclasses in the [apache_beam.coders](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/coders)
package.
 
-When you read data into a pipeline, the coder indicates how to interpret the input data into
a language-specific type, such as integer or string. Likewise, the coder indicates how the
language-specific types in your pipeline should be written into byte strings for an output
data sink, or to materialize intermediate data in your pipeline.
-
-The Beam SDKs set a coder for every `PCollection` in a pipeline, including those generated
as output from a transform. Most of the time, the Beam SDKs can automatically infer the correct
coder for an output `PCollection`.
-
 > Note that coders do not necessarily have a 1:1 relationship with types. For example,
the Integer type can have multiple valid coders, and input and output data can use different
Integer coders. A transform might have Integer-typed input data that uses BigEndianIntegerCoder,
and Integer-typed output data that uses VarIntCoder.
 
-You can explicitly set a `Coder` when inputting or outputting a `PCollection`. You set the
`Coder` by <span class="language-java">calling the method `.withCoder`</span>
<span class="language-py">setting the `coder` argument</span> when you apply your
pipeline's read or write transform.
-
-Typically, you set the `Coder` when the coder for a `PCollection` cannot be automatically
inferred, or when you want to use a different coder than your pipeline's default. The following
example code reads a set of numbers from a text file, and sets a `Coder` of type <span
class="language-java">`TextualIntegerCoder`</span> <span class="language-py">`VarIntCoder`</span>
for the resulting `PCollection`:
-
-```java
-PCollection<Integer> numbers =
-  p.begin()
-  .apply(TextIO.Read.named("ReadNumbers")
-    .from("gs://my_bucket/path/to/numbers-*.txt")
-    .withCoder(TextualIntegerCoder.of()));
-```
-
-```py
-p = beam.Pipeline()
-numbers = ReadFromText("gs://my_bucket/path/to/numbers-*.txt", coder=VarIntCoder())
-```
+### Specifying coders
+The Beam SDKs require a coder for every `PCollection` in your pipeline. In most cases, the
Beam SDK is able to automatically infer a `Coder` for a `PCollection` based on its element
type or the transform that produces it, however, in some cases the pipeline author will need
to specify a `Coder` explicitly, or develop a `Coder` for their custom type.
 
 {:.language-java}
-You can set the coder for an existing `PCollection` by using the method `PCollection.setCoder`.
Note that you cannot call `setCoder` on a `PCollection` that has been finalized (e.g. by calling
`.apply` on it).
+You can explicitly set the coder for an existing `PCollection` by using the method `PCollection.setCoder`.
Note that you cannot call `setCoder` on a `PCollection` that has been finalized (e.g. by calling
`.apply` on it).
 
 {:.language-java}
-You can get the coder for an existing `PCollection` by using the method `getCoder`. This
method will fail with `anIllegalStateException` if a coder has not been set and cannot be
inferred for the given `PCollection`.
-
-### Coder inference and default coders
+You can get the coder for an existing `PCollection` by using the method `getCoder`. This
method will fail with an `IllegalStateException` if a coder has not been set and cannot be
inferred for the given `PCollection`.
 
-The Beam SDKs require a coder for every `PCollection` in your pipeline. Most of the time,
however, you do not need to explicitly specify a coder, such as for an intermediate `PCollection`
produced by a transform in the middle of your pipeline. In such cases, the Beam SDKs can infer
an appropriate coder from the inputs and outputs of the transform used to produce the PCollection.
+Beam SDKs use a variety of mechanisms when attempting to automatically infer the `Coder`
for a `PCollection`.
 
 {:.language-java}
 Each pipeline object has a `CoderRegistry`. The `CoderRegistry` represents a mapping of Java
types to the default coders that the pipeline should use for `PCollection`s of each type.
@@ -1227,7 +1205,7 @@ Each pipeline object has a `CoderRegistry`. The `CoderRegistry` represents
a map
 The Beam SDK for Python has a `CoderRegistry` that represents a mapping of Python types to
the default coder that should be used for `PCollection`s of each type.
 
 {:.language-java}
-By default, the Beam SDK for Java automatically infers the `Coder` for the elements of an
output `PCollection` using the type parameter from the transform's function object, such as
`DoFn`. In the case of `ParDo`, for example, a `DoFn<Integer, String>function` object
accepts an input element of type `Integer` and produces an output element of type `String`.
In such a case, the SDK for Java will automatically infer the default `Coder` for the output
`PCollection<String>` (in the default pipeline `CoderRegistry`, this is `StringUtf8Coder`).
+By default, the Beam SDK for Java automatically infers the `Coder` for the elements of a
`PCollection` produced by a `PTransform` using the type parameter from the transform's function
object, such as `DoFn`. In the case of `ParDo`, for example, a `DoFn<Integer, String>`
function object accepts an input element of type `Integer` and produces an output element
of type `String`. In such a case, the SDK for Java will automatically infer the default `Coder`
for the output `PCollection<String>` (in the default pipeline `CoderRegistry`, this
is `StringUtf8Coder`).
 
 {:.language-py}
 By default, the Beam SDK for Python automatically infers the `Coder` for the elements of
an output `PCollection` using the typehints from the transform's function object, such as
`DoFn`. In the case of `ParDo`, for example a `DoFn` with the typehints `@beam.typehints.with_input_types(int)`
and `@beam.typehints.with_output_types(str)` accepts an input element of type int and produces
an output element of type str. In such a case, the Beam SDK for Python will automatically
infer the default `Coder` for the output `PCollection` (in the default pipeline `CoderRegistry`,
this is `BytesCoder`).


Mime
View raw message