beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-1799) IO ITs: simplify data loading design pattern
Date Wed, 12 Apr 2017 01:26:41 GMT

    [ https://issues.apache.org/jira/browse/BEAM-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965239#comment-15965239
] 

ASF GitHub Bot commented on BEAM-1799:
--------------------------------------

GitHub user ssisk opened a pull request:

    https://github.com/apache/beam/pull/2507

    [BEAM-1799] JdbcIOIT now uses writeThenRead style

    Removes JdbcTestDataSet's main, since it is no longer necessary for
    loading data.
    
    Be sure to do all of the following to help us incorporate your contribution
    quickly and easily:
    
     - [X] Make sure the PR title is formatted like:
       `[BEAM-<Jira issue #>] Description of pull request`
     - [X] Make sure tests pass via `mvn clean verify`. (Even better, enable
           Travis-CI on your fork and ensure the whole test matrix passes).
     - [X] Replace `<Jira issue #>` in the title with the actual Jira issue
           number, if there is one.
     - [X] If this contribution is large, please file an Apache
           [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
    
    ---
    I've now switched JdbcIOIT to do a write then a read. This code:
    * creates a table during test class setup (this runs on the client)
    * during the write test, generates data and populates data into that table
    * reads the data from the table and verifies it via HashingFn + count
    
    I really like the data generation - feels very clean, and I love the fact that using CountingSource
means this part can be parallelized for larger data sets. I suspect that we may choose to
refactor the "generate data" portion out into a helper method alongside HashingFn in io-common.
    
    R @dhalperi 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ssisk/beam jdbc-it-writeThenRead

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/2507.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2507
    
----
commit 26fb99021b1edd48f6660b95a106ecd9116c6f16
Author: Stephen Sisk <sisk@google.com>
Date:   2017-04-04T22:46:19Z

    JdbcIOIT now uses writeThenRead style
    
    Removes JdbcTestDataSet's main, since it is no longer necessary for
    loading data.

----


> IO ITs: simplify data loading design pattern
> --------------------------------------------
>
>                 Key: BEAM-1799
>                 URL: https://issues.apache.org/jira/browse/BEAM-1799
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Stephen Sisk
>            Assignee: Stephen Sisk
>
> Problems with the current solution
> =============================
> * The IO IT data loading guidelines [1] are complicated & aren't "native junit" -
you end up working around junit rather than working with it (I was a part of defining them[0],
so I critique the rules with (heart) )
> * Doing data loading using external tools means we have additional dependencies outside
of the tests themselves. If we *must* use them, it's worth the time, but I think we have another
option. I find it especially amusing since the data loading tools are things like ycsb which
themselves are benchmarking tools ("I heard you like performance benchmarking, so here's a
performance benchmarking tool to use before you use your performance benchmarking tool"),
and really are just solving the problem of "I want to write data in parallel to this data
store" - that sounds familiar :) 
> The current guidelines also don't scale well to performance tests:
> * We want to write medium sized data for perf tests - doing data loading using external
tools means a minimum of 2 reads & writes. For the small scale ITs, that's not a big deal,
but for the large scale tests, if we assume we're working with a fixed budget, more data transferred/stored
~= fewer tests.
> * If you want to verify that large data sets are correct (or create them), you need to
actually read and write those large data sets - currently, the plan is that data loading/testing
infrastructure only runs on one machine, so those operations are going to be slow. We aren't
working with actual large data sets, so it won't take too long, but it's always nice to have
faster tests.
> New Proposed Solution
> ===================
> Instead of trying to test read and write separately, the test should be a "write, then
read back what you just wrote", all using the IO under test. To support scenarios like "I
want to run my read test repeatedly without re-writing the data", tests would add flags for
"skipCleanUp" and "useExistingData".
> Check out the example I wrote up [2]
> I didn't want to invest much time on this before I opened a Jira/talked to others, so
I plan on expanding on this a bit more/formalizing it in the testing docs.
> A reminder of some context:
> * The goals for the ITs & Perf tests are that they are *not* intended to be the place
where we exercise specific scenarios. Instead, they are tripwires designed to find problems
with code *we already believe works* (as proven by the unit tests) when it runs against real
data store instances/runners using multiple nodes of both.
> There are some definite disadvantages: 
> * There is a class of bugs that you can miss doing this. (namely: "I mangled the data
on the way into the data store, and then reverse-mangled it again on the way back out so it
looks fine, even though it is bad in the db") I assume that many of us have tested storage
code in the past, and so we've thought about this trade-off. In this particular environment,
where it's expensive/tricky to do independent testing of the storage code, I think this is
the right trade off.
> * The data loading scripts cannot be re-used between languages. I think this will be
a pretty small relative cost compared to the cost of writing the IO in multiple languages,
so it shouldn't matter too much. I think we'll save more time in not needing to use external
tools for loading data.
> * Read-only or write-only data stores - in this case, we'll either need to default to
the old plan, or implement data loading or verification using beam
> * This assumes the data store support parallelism - in the case where the read or write
cannot be split, we probably should limit the amount of data we process in the tests to what
we can reasonably do on a single worker anyway.
> * It's harder to debug when this fails - I agree, and part of what I hope to invest a
little time in as I go forward is to make it easier to determine what the actual failure is.
Presumably folks debugging a particular IO's failures have tools to look at that IO and will
be able to quickly determine if it's failing on the read or write.
> * As with the previously before accepted proposal, we are relying on junit's afterClass
to do cleanups. I don't have a good answer for this - if it proves to be a problem, we can
investigate. 
> * This focuses the test exclusively on reading and writing. To address this, if we wanted
to write other types of tests, they could either piggy back off the writeThenRead test, or
it might be that they should be restricted to smaller data sets and they should be tested
independently from this test and simply write their own data to the data store.
> There are some really nice advantages:
> * The test ends up being pretty simple and elegant.
> * We have no external dependencies
> * Read and write occurs the bare minimum number of times
> * I believe we'll be able to create shared PTransforms for generating test data &
validating test data.
> [0] Past discussion of IT guidelines - https://lists.apache.org/thread.html/a8ea2507aee4a849cbb6cd7f3ae23fc8b47d447bd553fa01d6da6348@%3Cdev.beam.apache.org%3E
> [1] Current data loading for IT guidelines - https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m
> [2] Example of writeThenRead test - https://github.com/ssisk/beam/blob/jdbc-it-perf/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOIT.java#L147



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message