beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [beam-site] 01/01: Prepare repository for deployment.
Date Wed, 19 Jul 2017 19:19:47 GMT
This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch asf-site
in repository

commit e65a4057c96666f5431ef63e1bfc8dde92e51d82
Author: Mergebot <>
AuthorDate: Wed Jul 19 19:19:44 2017 +0000

    Prepare repository for deployment.
 content/documentation/io/io-toc/index.html  |   3 +-
 content/documentation/io/testing/index.html | 113 +++++++++++++++++++++++++++-
 2 files changed, 111 insertions(+), 5 deletions(-)

diff --git a/content/documentation/io/io-toc/index.html b/content/documentation/io/io-toc/index.html
index 1cd94ea..1c2002a 100644
--- a/content/documentation/io/io-toc/index.html
+++ b/content/documentation/io/io-toc/index.html
@@ -153,12 +153,13 @@
   <li><a href="/documentation/io/authoring-overview/">Authoring I/O Transforms
- Overview</a></li>
+  <li><a href="/documentation/io/testing/">Testing I/O Transforms</a></li>
 <!-- TODO: commented out until this content is ready.
 * [Authoring I/O Transforms - Python](/documentation/io/authoring-python/)
 * [Authoring I/O Transforms - Java](/documentation/io/authoring-java/)
-* [Testing I/O Transforms](/documentation/io/testing/)
 * [Contributing I/O Transforms](/documentation/io/contributing/)
diff --git a/content/documentation/io/testing/index.html b/content/documentation/io/testing/index.html
index 86d132a..e8173ff 100644
--- a/content/documentation/io/testing/index.html
+++ b/content/documentation/io/testing/index.html
@@ -139,17 +139,122 @@
     <div class="body__contained">
       <p><a href="/documentation/io/io-toc/">Pipeline I/O Table of Contents</a></p>
-<h1 id="testing-io-transforms">Testing I/O Transforms</h1>
+<h2 id="testing-io-transforms-in-apache-beam">Testing I/O Transforms in Apache Beam</h2>
+<p><em>Examples and design patterns for testing Apache Beam I/O transforms</em></p>
+<nav class="language-switcher">
+  <strong>Adapt for:</strong>
+  <ul>
+    <li data-type="language-java" class="active">Java SDK</li>
+    <li data-type="language-py">Python SDK</li>
+  </ul>
   <p>Note: This guide is still in progress. There is an open issue to finish the guide:
<a href="">BEAM-1025</a>.</p>
-<h1 id="next-steps">Next steps</h1>
+<h2 id="introduction">Introduction</h2>
+<p>This document explains the set of tests that the Beam community recommends based
on our past experience writing I/O transforms. If you wish to contribute your I/O transform
to the Beam community, we’ll ask you to implement these tests.</p>
+<p>While it is standard to write unit tests and integration tests, there are many possible
definitions. Our definitions are:</p>
+  <li><strong>Unit Tests:</strong>
+    <ul>
+      <li>Goal: verifying correctness of the transform only - core behavior, corner
cases, etc.</li>
+      <li>Data store used: an in-memory version of the data store (if available), otherwise
you’ll need to write a <a href="#use-fakes">fake</a></li>
+      <li>Data set size: tiny (10s to 100s of rows)</li>
+    </ul>
+  </li>
+  <li><strong>Integration Tests:</strong>
+    <ul>
+      <li>Goal: catch problems that occur when interacting with real versions of the
runners/data store</li>
+      <li>Data store used: an actual instance, pre-configured before the test</li>
+      <li>Data set size: small to medium (1000 rows to 10s of GBs)</li>
+    </ul>
+  </li>
+<h2 id="a-note-on-performance-benchmarking">A note on performance benchmarking</h2>
+<p>We do not advocate writing a separate test specifically for performance benchmarking.
Instead, we recommend setting up integration tests that can accept the necessary parameters
to cover many different testing scenarios.</p>
+<p>For example, if integration tests are written according to the guidelines below,
the integration tests can be run on different runners (either local or in a cluster configuration)
and against a data store that is a small instance with a small data set, or a large production-ready
cluster with larger data set. This can provide coverage for a variety of scenarios - one of
them is performance benchmarking.</p>
+<h2 id="test-balance-unit-vs-integration">Test Balance - Unit vs Integration</h2>
+<p>It’s easy to cover a large amount of code with an integration test, but it is
then hard to find a cause for test failures and the test is flakier.</p>
+<p>However, there is a valuable set of bugs found by tests that exercise multiple workers
reading/writing to data store instances that have multiple nodes (eg, read replicas, etc.).
 Those scenarios are hard to find with unit tests and we find they commonly cause bugs in
I/O transforms.</p>
+<p>Our test strategy is a balance of those 2 contradictory needs. We recommend doing
as much testing as possible in unit tests, and writing a single, small integration test that
can be run in various configurations.</p>
+<h2 id="examples">Examples</h2>
+  <li><a href="">BigtableIO</a>’s
testing implementation is considered the best example of current best practices for unit testing
<code class="highlighter-rouge">Source</code>s</li>
+  <li><a href="">JdbcIO</a>
has the current best practice examples for writing integration tests.</li>
+  <li><a href="">ElasticsearchIO</a>
demonstrates testing for bounded read/write</li>
+  <li><a href="">MqttIO</a>
and <a href="">AmpqpIO</a>
demonstrate unbounded read/write</li>
+  <li><a href="">avroio_test</a>
for examples of testing liquid sharding, <code class="highlighter-rouge">source_test_utils</code>,
<code class="highlighter-rouge">assert_that</code> and <code class="highlighter-rouge">equal_to</code></li>
+<h2 id="unit-tests">Unit Tests</h2>
+<h3 id="goals">Goals</h3>
+  <li>Validate the correctness of the code in your I/O transform.</li>
+  <li>Validate that the I/O transform works correctly when used in concert with reference
implementations of the data store it connects with (where “reference implementation” means
a fake or in-memory version).</li>
+  <li>Be able to run quickly and need only one machine, with a reasonably small memory/disk
footprint and no non-local network access (preferably none at all). Aim for tests than run
within several seconds - anything above 20 seconds should be discussed with the beam dev mailing
+  <li>Validate that the I/O transform can handle network failures.</li>
+<h3 id="non-goals">Non-goals</h3>
+  <li>Test problems in the external data store - this can lead to extremely complicated
+<h3 id="implementing-unit-tests">Implementing unit tests</h3>
+<p>A general guide to writing Unit Tests for all transforms can be found in the <a
href="">PTransform Style
Guide</a>. We have expanded on a few important points below.</p>
+<p>If you are using the <code class="highlighter-rouge">Source</code> API,
make sure to exhaustively unit-test your code. A minor implementation error can lead to data
corruption or data loss (such as skipping or duplicating records) that can be hard for your
users to detect. Also look into using <span class="language-java"><code class="highlighter-rouge">SourceTestUtils</code></span><span
class="language-py"><code class="highlighter-rouge">source_test_utils</code></span>
- it is a key p [...]
+<p>If you are not using the <code class="highlighter-rouge">Source</code>
API, you can use <code class="highlighter-rouge">TestPipeline</code> with <span
class="language-java"><code class="highlighter-rouge">PAssert</code></span><span
class="language-py"><code class="highlighter-rouge">assert_that</code></span>
to help with your testing.</p>
+<p>If you are implementing write, you can use <code class="highlighter-rouge">TestPipeline</code>
to write test data and then read and verify it using a non-Beam client.</p>
+<h3 id="use-fakes">Use fakes</h3>
+<p>Instead of using mocks in your unit tests (pre-programming exact responses to each
call for each test), use fakes. The preferred way to use fakes for I/O transform testing is
to use a pre-existing in-memory/embeddable version of the service you’re testing, but if
one does not exist consider implementing your own. Fakes have proven to be the right mix of
“you can get the conditions for testing you need” and “you don’t have to write a million
exacting mock function calls”.</p>
+<h3 id="network-failure">Network failure</h3>
+<p>To help with testing and separation of concerns, <strong>code that interacts
across a network should be handled in a separate class from your I/O transform</strong>.
The suggested design pattern is that your I/O transform throws exceptions once it determines
that a read or write is no longer possible.</p>
+<p>This allows the I/O transform’s unit tests to act as if they have a perfect network
connection, and they do not need to retry/otherwise handle network connection problems.</p>
+<h2 id="batching">Batching</h2>
+<p>If your I/O transform allows batching of reads/writes, you must force the batching
to occur in your test. Having configurable batch size options on your I/O transform allows
that to happen easily. These must be marked as test only.</p>
+# Next steps
-<p>If you have a well tested I/O transform, why not contribute it to Apache Beam? Read
all about it:</p>
+If you have a well tested I/O transform, why not contribute it to Apache Beam? Read all about
-<p><a href="/documentation/io/contributing/">Contributing I/O Transforms</a></p>
+[Contributing I/O Transforms](/documentation/io/contributing/)

To stop receiving notification emails like this one, please contact
"" <>.

View raw message