beam-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Sisk <s...@google.com.INVALID>
Subject Re: Hosting data stores for IO Transform testing
Date Fri, 23 Dec 2016 23:07:44 GMT
hey Etienne -

thanks for your thoughts and thanks for sharing your experiences. I
generally agree with what you're saying. Quick comments below:

> IT are stored alongside with UT in src/test directory of the IO but they
might go to dedicated module, waiting for a consensus
I don't have a strong opinion or feel that I've worked enough with maven to
understand all the consequences - I'd love for someone with more maven
experience to weigh in. If this becomes blocking, I'd say check it in, and
we can refactor later if it proves problematic.

>  Also IMHO, it is better that tests load/clean data than doing some
assumptions about the running order of the tests.
I definitely agree that we don't want to make assumptions about the running
order of the tests - that way lies pain. :) It will be interesting to see
how the performance tests work out since they will need more data (and thus
loading data can take much longer.) This should also be an easier problem
for read tests than for write tests - if we have long running instances,
read tests don't really need cleanup. And if write tests only write a small
amount of data, as long as we are sure we're writing to uniquely
identifiable locations (ie, new table per test or something similar), we
can clean up the write test data on a slower schedule.

> this will tend to go to the direction of long running data store
instances rather than data store instances started (and optionally loaded)
before tests.
It may be easiest to start with a "data stores stay running"
implementation, and then if we see issues with that move towards tests that
start/stop the data stores on each run. One thing I'd like to make sure is
that we're not manually tweaking the configurations for data stores. One
way we could do that is to destroy/recreate the data stores on a slower
schedule - maybe once per week. That way if the script is changed or the
data store instances are changed, we'd be able to detect it relatively soon
while still removing the need for the tests to manage the data stores.

as a general note, I suspect many of the folks in the states will be on
holiday until Jan 2nd/3rd.

S

On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot <echauchot@gmail.com>
wrote:

> Hi,
>
> Recently we had a discussion about integration tests of IOs. I'm
> preparing a PR for integration tests of the elasticSearch IO
> (
> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-ELASTICSEARCH-IO
> as a first shot) which are very important IMHO because they helped catch
> some bugs that UT could not (volume, data store instance sharing, real
> data store instance ...)
>
> I would like to have your thoughts/remarks about points bellow. Some of
> these points are also discussed here
>
> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
> :
>
> - UT and IT have a similar architecture, but while UT focus on testing
> the correct behavior of the code including corner cases and use embedded
> in memory data store, IT assume that the behavior is correct (strong UT)
> and focus on higher volume testing and testing against real data store
> instance(s)
>
> - For now, IT are stored alongside with UT in src/test directory of the
> IO but they might go to dedicated module, waiting for a consensus. Maven
> is not configured to run them automatically because data store is not
> available on jenkins server yet
>
> - For now, they only use DirectRunner, but they will  be run against
> each runner.
>
> - IT do not setup data store instance (like stated in the above
> document) they assume that one is already running (hardcoded
> configuration in test for now, waiting for a common solution to pass
> configuration to IT). A docker container script is provided in the
> contrib directory as a starting point to whatever orchestration software
> will be chosen.
>
> - IT load and clean test data before and after each test if needed. It
> is simpler to do so because some tests need empty data store (write
> test) and because, as discussed in the document, tests might not be the
> only users of the data store. Also IMHO, it is better that tests
> load/clean data than doing some assumptions about the running order of
> the tests.
>
> If we generalize this pattern to all IT tests, this will tend to go to
> the direction of long running data store instances rather than data
> store instances started (and optionally loaded) before tests.
>
> Besides if we where to change our minds and load data from outside the
> tests, a logstash script is provided.
>
> If you have any thoughts or remarks I'm all ears :)
>
> Regards,
>
> Etienne
>
> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
> > Hi Stephen,
> >
> > the purpose of having in a specific module is to share resources and
> > apply the same behavior from IT perspective and be able to have IT
> > "cross" IO (for instance, reading from JMS and sending to Kafka, I
> > think that's the key idea for integration tests).
> >
> > For instance, in Karaf, we have:
> > - utest in each module
> > - itest module containing itests for all modules all together
> >
> > Regards
> > JB
> >
> > On 12/14/2016 04:59 PM, Stephen Sisk wrote:
> >> Hi Etienne,
> >>
> >> thanks for following up and answering my questions.
> >>
> >> re: where to store integration tests - having them all in a separate
> >> module
> >> is an interesting idea. I couldn't find JB's comments about moving them
> >> into a separate module in the PR - can you share the reasons for
> >> doing so?
> >> The IO integration/perf tests so it does seem like they'll need to be
> >> treated in a special manner, but given that there is already an IO
> >> specific
> >> module, it may just be that we need to treat all the ITs in the IO
> >> module
> >> the same way. I don't have strong opinions either way right now.
> >>
> >> S
> >>
> >> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <echauchot@gmail.com>
> >> wrote:
> >>
> >> Hi guys,
> >>
> >> @Stephen: I addressed all your comments directly in the PR, thanks!
> >> I just wanted to comment here about the docker image I used: the only
> >> official Elastic image contains only ElasticSearch. But for testing I
> >> needed logstash (for ingestion) and kibana (not for integration tests,
> >> but to easily test REST requests to ES using sense). This is why I use
> >> an ELK (Elasticsearch+Logstash+Kibana) image. This one isreleased under
> >> theapache 2 license.
> >>
> >>
> >> Besides, there is also a point about where to store integration tests:
> >> JB proposed in the PR to store integration tests to dedicated module
> >> rather than directly in the IO module (like I did).
> >>
> >>
> >>
> >> Etienne
> >>
> >> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
> >>> hey!
> >>>
> >>> thanks for sending this. I'm very excited to see this change. I
> >>> added some
> >>> detail-oriented code review comments in addition to what I've discussed
> >>> here.
> >>>
> >>> The general goal is to allow for re-usable instantiation of particular
> >> data
> >>> store instances and this seems like a good start. Looks like you
> >>> also have
> >>> a script to generate test data for your tests - that's great.
> >>>
> >>> The next steps (definitely not blocking your work) will be to have
> >>> ways to
> >>> create instances from the docker images you have here, and use them
> >>> in the
> >>> tests. We'll need support in the test framework for that since it'll be
> >>> different on developer machines and in the beam jenkins cluster, but
> >>> your
> >>> scripts here allow someone running these tests locally to not have to
> >> worry
> >>> about getting the instance set up and can manually adjust, so this is a
> >>> good incremental step.
> >>>
> >>> I have some thoughts now that I'm reviewing your scripts (that I didn't
> >>> have previously, so we are learning this together):
> >>> * It may be useful to try and document why we chose a particular docker
> >>> image as the base (ie, "this is the official supported elastic search
> >>> docker image" or "this image has several data stores together that
> >>> can be
> >>> used for a couple different tests")  - I'm curious as to whether the
> >>> community thinks that is important
> >>>
> >>> One thing that I called out in the comment that's worth mentioning
> >>> on the
> >>> larger list - if you want to specify which specific runners a test
> >>> uses,
> >>> that can be controlled in the pom for the module. I updated the testing
> >> doc
> >>> mentioned previously in this thread with a TODO to talk about this
> >>> more. I
> >>> think we should also make it so that IO modules have that
> >>> automatically,
> >> so
> >>> developers don't have to worry about it.
> >>>
> >>> S
> >>>
> >>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <echauchot@gmail.com>
> >> wrote:
> >>>
> >>> Stephen,
> >>>
> >>> As discussed, I added injection script, docker containers scripts and
> >>> integration tests to the sdks/java/io/elasticsearch/contrib
> >>> <
> >>>
> >>
> https://github.com/apache/incubator-beam/pull/1439/files/1e7e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7d824cefcb3ed0b9
> >>
> >>> directory in that PR:
> >>> https://github.com/apache/incubator-beam/pull/1439.
> >>>
> >>> These work well but they are first shot. Do you have any comments about
> >>> those?
> >>>
> >>> Besides I am not very sure that these files should be in the IO itself
> >>> (even in contrib directory, out of maven source directories). Any
> >> thoughts?
> >>>
> >>> Thanks,
> >>>
> >>> Etienne
> >>>
> >>>
> >>>
> >>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
> >>>> It's great to hear more experiences.
> >>>>
> >>>> I'm also glad to hear that people see real value in the high
> >>>> volume/performance benchmark tests. I tried to capture that in the
> >> Testing
> >>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
> >>>>
> >>>> It does generally sound like we're in agreement here. Areas of
> >>>> discussion
> >>> I
> >>>> see:
> >>>> 1.  People like the idea of bringing up fresh instances for each test
> >>>> rather than keeping instances running all the time, since that
> >>>> ensures no
> >>>> contamination between tests. That seems reasonable to me. If we see
> >>>> flakiness in the tests or we note that setting up/tearing down
> >>>> instances
> >>> is
> >>>> taking a lot of time,
> >>>> 2. Deciding on cluster management software/orchestration software - I
> >> want
> >>>> to make sure we land on the right tool here since choosing the
> >>>> wrong tool
> >>>> could result in administration of the instances taking more work. I
> >>> suspect
> >>>> that's a good place for a follow up discussion, so I'll start a
> >>>> separate
> >>>> thread on that. I'm happy with whatever tool we choose, but I want to
> >> make
> >>>> sure we take a moment to consider different options and have a
> >>>> reason for
> >>>> choosing one.
> >>>>
> >>>> Etienne - thanks for being willing to port your creation/other scripts
> >>>> over. You might be a good early tester of whether this system works
> >>>> well
> >>>> for everyone.
> >>>>
> >>>> Stephen
> >>>>
> >>>> [1]  Reasons for Beam Test Strategy -
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
> >>
> >>>>
> >>>>
> >>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré
> >>>> <jb@nanthrax.net>
> >>>> wrote:
> >>>>
> >>>>> I second Etienne there.
> >>>>>
> >>>>> We worked together on the ElasticsearchIO and definitely, the high
> >>>>> valuable test we did were integration tests with ES on docker and
> >>>>> high
> >>>>> volume.
> >>>>>
> >>>>> I think we have to distinguish the two kinds of tests:
> >>>>> 1. utests are located in the IO itself and basically they should
> >>>>> cover
> >>>>> the core behaviors of the IO
> >>>>> 2. itests are located as contrib in the IO (they could be part of
> >>>>> the IO
> >>>>> but executed by the integration-test plugin or a specific profile)
> >>>>> that
> >>>>> deals with "real" backend and high volumes. The resources required by
> >>>>> the itest can be bootstrapped by Jenkins (for instance using
> >>>>> Mesos/Marathon and docker images as already discussed, and it's
> >>>>> what I'm
> >>>>> doing on my own "server").
> >>>>>
> >>>>> It's basically what Stephen described.
> >>>>>
> >>>>> We have to not relay only on itest: utests are very important and
> >>>>> they
> >>>>> validate the core behavior.
> >>>>>
> >>>>> My $0.01 ;)
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
> >>>>>> Hi Stephen,
> >>>>>>
> >>>>>> I like your proposition very much and I also agree that docker +
> >>>>>> some
> >>>>>> orchestration software would be great !
> >>>>>>
> >>>>>> On the elasticsearchIO (PR to be created this week) there is docker
> >>>>>> container creation scripts and logstash data ingestion script for IT
> >>>>>> environment available in contrib directory alongside with
> >>>>>> integration
> >>>>>> tests themselves. I'll be happy to make them compliant to new IT
> >>>>>> environment.
> >>>>>>
> >>>>>> What you say bellow about the need for external IT environment is
> >>>>>> particularly true. As an example with ES what came out in first
> >>>>>> implementation was that there were problems starting at some high
> >> volume
> >>>>>> of data (timeouts, ES windowing overflow...) that could not have be
> >> seen
> >>>>>> on embedded ES version. Also there where some particularities to
> >>>>>> external instance like secondary (replica) shards that where not
> >> visible
> >>>>>> on embedded instance.
> >>>>>>
> >>>>>> Besides, I also favor bringing up instances before test because it
> >>>>>> allows (amongst other things) to be sure to start on a fresh dataset
> >> for
> >>>>>> the test to be deterministic.
> >>>>>>
> >>>>>> Etienne
> >>>>>>
> >>>>>>
> >>>>>> Le 23/11/2016 à 02:00, Stephen Sisk a écrit :
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I'm excited we're getting lots of discussion going. There are many
> >>>>>>> threads
> >>>>>>> of conversation here, we may choose to split some of them off
> >>>>>>> into a
> >>>>>>> different email thread. I'm also betting I missed some of the
> >>>>>>> questions in
> >>>>>>> this thread, so apologies ahead of time for that. Also apologies
> >>>>>>> for
> >>> the
> >>>>>>> amount of text, I provided some quick summaries at the top of each
> >>>>>>> section.
> >>>>>>>
> >>>>>>> Amit - thanks for your thoughts. I've responded in detail below.
> >>>>>>> Ismael - thanks for offering to help. There's plenty of work
> >>>>>>> here to
> >> go
> >>>>>>> around. I'll try and think about how we can divide up some next
> >>>>>>> steps
> >>>>>>> (probably in a separate thread.) The main next step I see is
> >>>>>>> deciding
> >>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm working on
> >>>>>>> that,
> >>>>> but
> >>>>>>> having lots of different thoughts on what the
> >>>>>>> advantages/disadvantages
> >>>>> of
> >>>>>>> those are would be helpful (I'm not entirely sure of the
> >>>>>>> protocol for
> >>>>>>> collaborating on sub-projects like this.)
> >>>>>>>
> >>>>>>> These issues are all related to what kind of tests we want to
> >>>>>>> write. I
> >>>>>>> think a kubernetes/mesos/swarm cluster could support all the use
> >>>>>>> cases
> >>>>>>> we've discussed here (and thus should not block moving forward with
> >>>>>>> this),
> >>>>>>> but understanding what we want to test will help us understand
> >>>>>>> how the
> >>>>>>> cluster will be used. I'm working on a proposed user guide for
> >>>>>>> testing
> >>>>> IO
> >>>>>>> Transforms, and I'm going to send out a link to that + a short
> >>>>>>> summary
> >>>>> to
> >>>>>>> the list shortly so folks can get a better sense of where I'm
> >>>>>>> coming
> >>>>>>> from.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Here's my thinking on the questions we've raised here -
> >>>>>>>
> >>>>>>> Embedded versions of data stores for testing
> >>>>>>> --------------------
> >>>>>>> Summary: yes! But we still need real data stores to test against.
> >>>>>>>
> >>>>>>> I am a gigantic fan of using embedded versions of the various data
> >>>>>>> stores.
> >>>>>>> I think we should test everything we possibly can using them,
> >>>>>>> and do
> >>> the
> >>>>>>> majority of our correctness testing using embedded versions + the
> >>> direct
> >>>>>>> runner. However, it's also important to have at least one test that
> >>>>>>> actually connects to an actual instance, so we can get coverage for
> >>>>>>> things
> >>>>>>> like credentials, real connection strings, etc...
> >>>>>>>
> >>>>>>> The key point is that embedded versions definitely can't cover the
> >>>>>>> performance tests, so we need to host instances if we want to test
> >>> that.
> >>>>>>> I consider the integration tests/performance benchmarks to be
> >>>>>>> costly
> >>>>>>> things
> >>>>>>> that we do only for the IO transforms with large amounts of
> >>>>>>> community
> >>>>>>> support/usage. A random IO transform used by a few users doesn't
> >>>>>>> necessarily need integration & perf tests, but for heavily used IO
> >>>>>>> transforms, there's a lot of community value in these tests. The
> >>>>>>> maintenance proposal below scales with the amount of community
> >>>>>>> support
> >>>>>>> for
> >>>>>>> a particular IO transform.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Reusing data stores ("use the data stores across executions.")
> >>>>>>> ------------------
> >>>>>>> Summary: I favor a hybrid approach: some frequently used, very
> >>>>>>> small
> >>>>>>> instances that we keep up all the time + larger multi-container
> >>>>>>> data
> >>>>>>> store
> >>>>>>> instances that we spin up for perf tests.
> >>>>>>>
> >>>>>>> I don't think we need to have a strong answer to this question,
> >>>>>>> but I
> >>>>>>> think
> >>>>>>> we do need to know what range of capabilities we need, and use
> >>>>>>> that to
> >>>>>>> inform our requirements on the hosting infrastructure. I think
> >>>>>>> kubernetes/mesos + docker can support all the scenarios I discuss
> >>> below.
> >>>>>>> I had been thinking of a hybrid approach - reuse some instances and
> >>>>> don't
> >>>>>>> reuse others. Some tests require isolation from other tests (eg.
> >>>>>>> performance benchmarking), while others can easily re-use the same
> >>>>>>> database/data store instance over time, provided they are
> >>>>>>> written in
> >>> the
> >>>>>>> correct manner (eg. a simple read or write correctness integration
> >>>>> tests)
> >>>>>>> To me, the question of whether to use one instance over time for a
> >>>>>>> test vs
> >>>>>>> spin up an instance for each test comes down to a trade off between
> >>>>> these
> >>>>>>> factors:
> >>>>>>> 1. Flakiness of spin-up of an instance - if it's super flaky, we'll
> >>>>>>> want to
> >>>>>>> keep more instances up and running rather than bring them up/down.
> >>> (this
> >>>>>>> may also vary by the data store in question)
> >>>>>>> 2. Frequency of testing - if we are running tests every 5
> >>>>>>> minutes, it
> >>>>> may
> >>>>>>> be wasteful to bring machines up/down every time. If we run
> >>>>>>> tests once
> >>> a
> >>>>>>> day or week, it seems wasteful to keep the machines up the whole
> >>>>>>> time.
> >>>>>>> 3. Isolation requirements - If tests must be isolated, it means we
> >>>>> either
> >>>>>>> have to bring up the instances for each test, or we have to have
> >>>>>>> some
> >>>>>>> sort
> >>>>>>> of signaling mechanism to indicate that a given instance is in
> >>>>>>> use. I
> >>>>>>> strongly favor bringing up an instance per test.
> >>>>>>> 4. Number/size of containers - if we need a large number of
> >>>>>>> machines
> >>>>>>> for a
> >>>>>>> particular test, keeping them running all the time will use more
> >>>>>>> resources.
> >>>>>>>
> >>>>>>>
> >>>>>>> The major unknown to me is how flaky it'll be to spin these up. I'm
> >>>>>>> hopeful/assuming they'll be pretty stable to bring up, but I
> >>>>>>> think the
> >>>>>>> best
> >>>>>>> way to test that is to start doing it.
> >>>>>>>
> >>>>>>> I suspect the sweet spot is the following: have a set of very small
> >>> data
> >>>>>>> store instances that stay up to support small-data-size post-commit
> >>>>>>> end to
> >>>>>>> end tests (post-commits run frequently and the data size means the
> >>>>>>> instances would not use many resources), combined with the
> >>>>>>> ability to
> >>>>>>> spin
> >>>>>>> up larger instances for once a day/week performance benchmarks
> >>>>>>> (these
> >>>>> use
> >>>>>>> up more resources and are used less frequently.) That's the mix
> >>>>>>> I'll
> >>>>>>> propose in my docs on testing IO transforms.  If spinning up new
> >>>>>>> instances
> >>>>>>> is cheap/non-flaky, I'd be fine with the idea of spinning up
> >>>>>>> instances
> >>>>>>> for
> >>>>>>> each test.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Management ("what's the overhead of managing such a deployment")
> >>>>>>> --------------------
> >>>>>>> Summary: I propose that anyone can contribute scripts for
> >>>>>>> setting up
> >>>>> data
> >>>>>>> store instances + integration/perf tests, but if the community
> >>>>>>> doesn't
> >>>>>>> maintain a particular data store's tests, we disable the tests and
> >>>>>>> turn off
> >>>>>>> the data store instances.
> >>>>>>>
> >>>>>>> Management of these instances is a crucial question. First, let's
> >> break
> >>>>>>> down what tasks we'll need to do on a recurring basis:
> >>>>>>> 1. Ongoing maintenance (update to new versions, both instance &
> >>>>>>> dependencies) - we don't want to have a lot of old versions
> >>>>>>> vulnerable
> >>>>> to
> >>>>>>> attacks/buggy
> >>>>>>> 2. Investigate breakages/regressions
> >>>>>>> (I'm betting there will be more things we'll discover - let me
> >>>>>>> know if
> >>>>>>> you
> >>>>>>> have suggestions)
> >>>>>>>
> >>>>>>> There's a couple goals I see:
> >>>>>>> 1. We should only do sys admin work for things that give us a
> >>>>>>> lot of
> >>>>>>> benefit. (ie, don't build IT/perf/data store set up scripts for
> >>>>>>> data
> >>>>>>> stores
> >>>>>>> without a large community)
> >>>>>>> 2. We should do as much as possible of testing via
> >>>>>>> in-memory/embedded
> >>>>>>> testing (as you brought up).
> >>>>>>> 3. Reduce the amount of manual administration overhead
> >>>>>>>
> >>>>>>> As I discussed above, I think that integration tests/performance
> >>>>>>> benchmarks
> >>>>>>> are costly things that we should do only for the IO transforms with
> >>>>> large
> >>>>>>> amounts of community support/usage. Thus, I propose that we
> >>>>>>> limit the
> >>> IO
> >>>>>>> transforms that get integration tests & performance benchmarks to
> >> those
> >>>>>>> that have community support for maintaining the data store
> >>>>>>> instances.
> >>>>>>>
> >>>>>>> We can enforce this organically using some simple rules:
> >>>>>>> 1. Investigating breakages/regressions: if a given integration/perf
> >>> test
> >>>>>>> starts failing and no one investigates it within a set period of
> >>>>>>> time
> >>> (a
> >>>>>>> week?), we disable the tests and shut off the data store
> >>>>>>> instances if
> >>> we
> >>>>>>> have instances running. When someone wants to step up and
> >>>>>>> support it
> >>>>>>> again,
> >>>>>>> they can fix the test, check it in, and re-enable the test.
> >>>>>>> 2. Ongoing maintenance: every N months, file a jira issue that
> >>>>>>> is just
> >>>>>>> "is
> >>>>>>> the IO Transform X data store up to date?" - if the jira is not
> >>>>>>> resolved in
> >>>>>>> a set period of time (1 month?), the perf/integration tests are
> >>>>> disabled,
> >>>>>>> and the data store instances shut off.
> >>>>>>>
> >>>>>>> This is pretty flexible -
> >>>>>>> * If a particular person or organization wants to support an IO
> >>>>>>> transform,
> >>>>>>> they can. If a group of people all organically organize to keep the
> >>>>> tests
> >>>>>>> running, they can.
> >>>>>>> * It can be mostly automated - there's not a lot of central
> >>>>>>> organizing
> >>>>>>> work
> >>>>>>> that needs to be done.
> >>>>>>>
> >>>>>>> Exposing the information about what IO transforms currently have
> >>> running
> >>>>>>> IT/perf benchmarks on the website will let users know what IO
> >>> transforms
> >>>>>>> are well supported.
> >>>>>>>
> >>>>>>> I like this solution, but I also recognize this is a tricky
> >>>>>>> problem.
> >>>>> This
> >>>>>>> is something the community needs to be supportive of, so I'm
> >>>>>>> open to
> >>>>>>> other
> >>>>>>> thoughts.
> >>>>>>>
> >>>>>>>
> >>>>>>> Simulating failures in real nodes ("programmatic tests to simulate
> >>>>>>> failure")
> >>>>>>> -----------------
> >>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We should
> >>>>>>> encourage a
> >>>>>>> design pattern separating out network/retry logic from the main IO
> >>>>>>> transform logic
> >>>>>>>
> >>>>>>> We *could* create instance failure in any container management
> >> software
> >>>>> -
> >>>>>>> we can use their programmatic APIs to determine which containers
> >>>>>>> are
> >>>>>>> running the instances, and ask them to kill the container in
> >>>>>>> question.
> >>> A
> >>>>>>> slow node would be trickier, but I'm sure we could figure it out
> >>>>>>> - for
> >>>>>>> example, add a network proxy that would delay responses.
> >>>>>>>
> >>>>>>> However, I would argue that this type of testing doesn't gain us a
> >>>>>>> lot, and
> >>>>>>> is complicated to set up. I think it will be easier to test network
> >>>>>>> errors
> >>>>>>> and retry behavior in unit tests for the IO transforms.
> >>>>>>>
> >>>>>>> Part of the way to handle this is to separate out the read code
> >>>>>>> from
> >>> the
> >>>>>>> network code (eg. bigtable has BigtableService). If you put the
> >> "handle
> >>>>>>> errors/retry logic" code in a separate MySourceService class,
> >>>>>>> you can
> >>>>>>> test
> >>>>>>> MySourceService on the wide variety of networks errors/data store
> >>>>>>> problems,
> >>>>>>> and then your main IO transform tests focus on the read behavior
> >>>>>>> and
> >>>>>>> handling the small set of errors the MySourceService class will
> >> return.
> >>>>>>>
> >>>>>>> I also think we should focus on testing the IO Transform, not
> >>>>>>> the data
> >>>>>>> store - if we kill a node in a data store, it's that data store's
> >>>>>>> problem,
> >>>>>>> not beam's problem. As you were pointing out, there are a *large*
> >>>>>>> number of
> >>>>>>> possible ways that a particular data store can fail, and we
> >>>>>>> would like
> >>>>> to
> >>>>>>> support many different data stores. Rather than try to test that
> >>>>>>> each
> >>>>>>> data
> >>>>>>> store behaves well, we should ensure that we handle
> >>>>>>> generic/expected
> >>>>>>> errors
> >>>>>>> in a graceful manner.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Ismaeal had a couple other quick comments/questions, I'll answer
> >>>>>>> here
> >> -
> >>>>>>> We can use this to test other runners running on multiple
> >>>>>>> machines - I
> >>>>>>> agree. This is also necessary for a good performance benchmark
> >>>>>>> test.
> >>>>>>>
> >>>>>>> "providing the test machines to mount the cluster" - we can discuss
> >>> this
> >>>>>>> further, but one possible option is that google may be willing to
> >>> donate
> >>>>>>> something to support this.
> >>>>>>>
> >>>>>>> "IO Consistency" - let's follow up on those questions in another
> >>> thread.
> >>>>>>> That's as much about the public interface we provide to users as
> >>>>> anything
> >>>>>>> else. I agree with your sentiment that a user should be able to
> >>>>>>> expect
> >>>>>>> predictable behavior from the different IO transforms.
> >>>>>>>
> >>>>>>> Thanks for everyone's questions/comments - I really am excited
> >>>>>>> to see
> >>>>>>> that
> >>>>>>> people care about this :)
> >>>>>>>
> >>>>>>> Stephen
> >>>>>>>
> >>>>>>> On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <iemejia@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>>> ​Hello,
> >>>>>>>>
> >>>>>>>> @Stephen Thanks for your proposal, it is really interesting, I
> >>>>>>>> would
> >>>>>>>> really
> >>>>>>>> like to help with this. I have never played with Kubernetes but
> >>>>>>>> this
> >>>>>>>> seems
> >>>>>>>> a really nice chance to do something useful with it.
> >>>>>>>>
> >>>>>>>> We (at Talend) are testing most of the IOs using simple container
> >>>>> images
> >>>>>>>> and in some particular cases ‘clusters’ of containers using
> >>>>>>>> docker-compose
> >>>>>>>> (a little bit like Amit’s (2) proposal). It would be really
> >>>>>>>> nice to
> >>>>> have
> >>>>>>>> this at the Beam level, in particular to try to test more complex
> >>>>>>>> semantics, I don’t know how programmable kubernetes is to achieve
> >>>>>>>> this for
> >>>>>>>> example:
> >>>>>>>>
> >>>>>>>> Let’s think we have a cluster of Cassandra or Kafka nodes, I would
> >>>>>>>> like to
> >>>>>>>> have programmatic tests to simulate failure (e.g. kill a node), or
> >>>>>>>> simulate
> >>>>>>>> a really slow node, to ensure that the IO behaves as expected
> >>>>>>>> in the
> >>>>>>>> Beam
> >>>>>>>> pipeline for the given runner.
> >>>>>>>>
> >>>>>>>> Another related idea is to improve IO consistency: Today the
> >>>>>>>> different IOs
> >>>>>>>> have small differences in their failure behavior, I really
> >>>>>>>> would like
> >>>>>>>> to be
> >>>>>>>> able to predict with more precision what will happen in case of
> >>> errors,
> >>>>>>>> e.g. what is the correct behavior if I am writing to a Kafka
> >>>>>>>> node and
> >>>>>>>> there
> >>>>>>>> is a network partition, does the Kafka sink retries or no ? and
> >>>>>>>> what
> >>>>>>>> if it
> >>>>>>>> is the JdbcIO ?, will it work the same e.g. assuming
> >>>>>>>> checkpointing?
> >>>>>>>> Or do
> >>>>>>>> we guarantee exactly once writes somehow?, today I am not sure
> >>>>>>>> about
> >>>>>>>> what
> >>>>>>>> happens (or if the expected behavior depends on the runner),
> >>>>>>>> but well
> >>>>>>>> maybe
> >>>>>>>> it is just that I don’t know and we have tests to ensure this.
> >>>>>>>>
> >>>>>>>> Of course both are really hard problems, but I think with your
> >>>>>>>> proposal we
> >>>>>>>> can try to tackle them, as well as the performance ones. And
> >>>>>>>> apart of
> >>>>>>>> the
> >>>>>>>> data stores, I think it will be also really nice to be able to
> >>>>>>>> test
> >>> the
> >>>>>>>> runners in a distributed manner.
> >>>>>>>>
> >>>>>>>> So what is the next step? How do you imagine such integration
> >>>>>>>> tests?
> >>>>>>>> ? Who
> >>>>>>>> can provide the test machines so we can mount the cluster?
> >>>>>>>>
> >>>>>>>> Maybe my ideas are a bit too far away for an initial setup, but it
> >>>>>>>> will be
> >>>>>>>> really nice to start working on this.
> >>>>>>>>
> >>>>>>>> Ismael​
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <amitsela33@gmail.com
> >
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Stephen,
> >>>>>>>>>
> >>>>>>>>> I was wondering about how we plan to use the data stores across
> >>>>>>>> executions.
> >>>>>>>>> Clearly, it's best to setup a new instance (container) for every
> >>> test,
> >>>>>>>>> running a "standalone" store (say HBase/Cassandra for
> >>>>>>>>> example), and
> >>>>>>>>> once
> >>>>>>>>> the test is done, teardown the instance. It should also be
> >>>>>>>>> agnostic
> >>> to
> >>>>>>>> the
> >>>>>>>>> runtime environment (e.g., Docker on Kubernetes).
> >>>>>>>>> I'm wondering though what's the overhead of managing such a
> >>> deployment
> >>>>>>>>> which could become heavy and complicated as more IOs are
> >>>>>>>>> supported
> >>> and
> >>>>>>>> more
> >>>>>>>>> test cases introduced.
> >>>>>>>>>
> >>>>>>>>> Another way to go would be to have small clusters of different
> >>>>>>>>> data
> >>>>>>>> stores
> >>>>>>>>> and run against new "namespaces" (while lazily evicting old
> >>>>>>>>> ones),
> >>>>>>>>> but I
> >>>>>>>>> think this is less likely as maintaining a distributed instance
> >> (even
> >>>>> a
> >>>>>>>>> small one) for each data store sounds even more complex.
> >>>>>>>>>
> >>>>>>>>> A third approach would be to to simply have an "embedded"
> >>>>>>>>> in-memory
> >>>>>>>>> instance of a data store as part of a test that runs against it
> >>>>>>>>> (such as
> >>>>>>>> an
> >>>>>>>>> embedded Kafka, though not a data store).
> >>>>>>>>> This is probably the simplest solution in terms of orchestration,
> >>>>>>>>> but it
> >>>>>>>>> depends on having a proper "embedded" implementation for an IO.
> >>>>>>>>>
> >>>>>>>>> Does this make sense to you ? have you considered it ?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Amit
> >>>>>>>>>
> >>>>>>>>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <
> >> jb@nanthrax.net
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Stephen,
> >>>>>>>>>>
> >>>>>>>>>> as already discussed a bit together, it sounds great !
> >>>>>>>>>> Especially I
> >>>>>>>> like
> >>>>>>>>>> it as a both integration test platform and good coverage for
> >>>>>>>>>> IOs.
> >>>>>>>>>>
> >>>>>>>>>> I'm very late on this but, as said, I will share with you my
> >>> Marathon
> >>>>>>>>>> JSON and Mesos docker images.
> >>>>>>>>>>
> >>>>>>>>>> By the way, I started to experiment a bit kubernetes and
> >>>>>>>>>> swamp but
> >>>>>>>>>> it's
> >>>>>>>>>> not yet complete. I will share what I have on the same github
> >>>>>>>>>> repo.
> >>>>>>>>>>
> >>>>>>>>>> Thanks !
> >>>>>>>>>> Regards
> >>>>>>>>>> JB
> >>>>>>>>>>
> >>>>>>>>>> On 11/16/2016 11:36 PM, Stephen Sisk wrote:
> >>>>>>>>>>> Hi everyone!
> >>>>>>>>>>>
> >>>>>>>>>>> Currently we have a good set of unit tests for our IO
> >>>>>>>>>>> Transforms -
> >>>>>>>>> those
> >>>>>>>>>>> tend to run against in-memory versions of the data stores.
> >> However,
> >>>>>>>>> we'd
> >>>>>>>>>>> like to further increase our test coverage to include
> >>>>>>>>>>> running them
> >>>>>>>>>> against
> >>>>>>>>>>> real instances of the data stores that the IO Transforms work
> >>>>> against
> >>>>>>>>>> (e.g.
> >>>>>>>>>>> cassandra, mongodb, kafka, etc…), which means we'll need to
> >>>>>>>>>>> have
> >>>>> real
> >>>>>>>>>>> instances of various data stores.
> >>>>>>>>>>>
> >>>>>>>>>>> Additionally, if we want to do performance regression
> >>>>>>>>>>> detection,
> >>>>> it's
> >>>>>>>>>>> important to have instances of the services that behave
> >>>>>>>> realistically,
> >>>>>>>>>>> which isn't true of in-memory or dev versions of the services.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Proposed solution
> >>>>>>>>>>> -------------------------
> >>>>>>>>>>> If we accept this proposal, we would create an
> >>>>>>>>>>> infrastructure for
> >>>>>>>>> running
> >>>>>>>>>>> real instances of data stores inside of containers, using
> >> container
> >>>>>>>>>>> management software like mesos/marathon, kubernetes, docker
> >>>>>>>>>>> swarm,
> >>>>>>>> etc…
> >>>>>>>>>> to
> >>>>>>>>>>> manage the instances.
> >>>>>>>>>>>
> >>>>>>>>>>> This would enable us to build integration tests that run
> >>>>>>>>>>> against
> >>>>>>>> those
> >>>>>>>>>> real
> >>>>>>>>>>> instances and performance tests that run against those real
> >>>>> instances
> >>>>>>>>>> (like
> >>>>>>>>>>> those that Jason Kuster is proposing elsewhere.)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Why do we need one centralized set of instances vs just having
> >>>>>>>> various
> >>>>>>>>>>> people host their own instances?
> >>>>>>>>>>> -------------------------
> >>>>>>>>>>> Reducing flakiness of tests is key. By not having dependencies
> >> from
> >>>>>>>> the
> >>>>>>>>>>> core project on external services/instances of data stores
> >>>>>>>>>>> we have
> >>>>>>>>>>> guaranteed access to the services and the group can fix issues
> >> that
> >>>>>>>>>> arise.
> >>>>>>>>>>> An exception would be something that has an ops team
> >>>>>>>>>>> supporting it
> >>>>>>>> (eg,
> >>>>>>>>>>> AWS, Google Cloud or other professionally managed service) -
> >>>>>>>>>>> those
> >>>>> we
> >>>>>>>>>> trust
> >>>>>>>>>>> will be stable.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> There may be a lot of different data stores needed - how
> >>>>>>>>>>> will we
> >>>>>>>>> maintain
> >>>>>>>>>>> them?
> >>>>>>>>>>> -------------------------
> >>>>>>>>>>> It will take work above and beyond that of a normal set of unit
> >>>>> tests
> >>>>>>>>> to
> >>>>>>>>>>> build and maintain integration/performance tests & their data
> >> store
> >>>>>>>>>>> instances.
> >>>>>>>>>>>
> >>>>>>>>>>> Setup & maintenance of the data store containers and data store
> >>>>>>>>> instances
> >>>>>>>>>>> on it must be automated. It also has to be as simple of a
> >>>>>>>>>>> setup as
> >>>>>>>>>>> possible, and we should avoid hand tweaking the containers -
> >>>>>>>> expecting
> >>>>>>>>>>> checked in scripts/dockerfiles is key.
> >>>>>>>>>>>
> >>>>>>>>>>> Aligned with the community ownership approach of Apache, as
> >> members
> >>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>> community are excited to contribute & maintain those tests
> >>>>>>>>>>> and the
> >>>>>>>>>>> integration/performance tests, people will be able to step
> >>>>>>>>>>> up and
> >>> do
> >>>>>>>>>> that.
> >>>>>>>>>>> If there is no longer support for maintaining a particular
> >>>>>>>>>>> set of
> >>>>>>>>>>> integration & performance tests and their data store instances,
> >>> then
> >>>>>>>> we
> >>>>>>>>>> can
> >>>>>>>>>>> disable those tests. We may document on the website what IO
> >>>>>>>> Transforms
> >>>>>>>>>> have
> >>>>>>>>>>> current integration/performance tests so users know what
> >>>>>>>>>>> level of
> >>>>>>>>> testing
> >>>>>>>>>>> the various IO Transforms have.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> What about requirements for the container management software
> >>>>> itself?
> >>>>>>>>>>> -------------------------
> >>>>>>>>>>> * We should have the data store instances themselves in Docker.
> >>>>>>>> Docker
> >>>>>>>>>>> allows new instances to be spun up in a quick, reproducible way
> >> and
> >>>>>>>> is
> >>>>>>>>>>> fairly platform independent. It has wide support from a
> >>>>>>>>>>> variety of
> >>>>>>>>>>> different container management services.
> >>>>>>>>>>> * As little admin work required as possible. Crashing instances
> >>>>>>>> should
> >>>>>>>>> be
> >>>>>>>>>>> restarted, setup should be simple, everything possible
> >>>>>>>>>>> should be
> >>>>>>>>>>> scripted/scriptable.
> >>>>>>>>>>> * Logs and test output should be on a publicly available
> >>>>>>>>>>> website,
> >>>>>>>>> without
> >>>>>>>>>>> needing to log into test execution machine. Centralized
> >>>>>>>>>>> capture of
> >>>>>>>>>>> monitoring info/logs from instances running in the containers
> >> would
> >>>>>>>>>> support
> >>>>>>>>>>> this. Ideally, this would just be supported by the container
> >>>>> software
> >>>>>>>>> out
> >>>>>>>>>>> of the box.
> >>>>>>>>>>> * It'd be useful to have good persistent volume in the
> >>>>>>>>>>> container
> >>>>>>>>>> management
> >>>>>>>>>>> software so that databases don't have to reload large data sets
> >>>>> every
> >>>>>>>>>> time.
> >>>>>>>>>>> * The containers may be a place to execute runners
> >>>>>>>>>>> themselves if
> >> we
> >>>>>>>>> need
> >>>>>>>>>>> larger runner instances, so it should play well with Spark,
> >>>>>>>>>>> Flink,
> >>>>>>>> etc…
> >>>>>>>>>>> As I discussed earlier on the mailing list, it looks like
> >>>>>>>>>>> hosting
> >>>>>>>>> docker
> >>>>>>>>>>> containers on kubernetes, docker swarm or mesos+marathon
> >>>>>>>>>>> would be
> >> a
> >>>>>>>>> good
> >>>>>>>>>>> solution.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Stephen Sisk
> >>>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Jean-Baptiste Onofré
> >>>>>>>>>> jbonofre@apache.org
> >>>>>>>>>> http://blog.nanthrax.net
> >>>>>>>>>> Talend - http://www.talend.com
> >>>>>>>>>>
> >>>>> --
> >>>>> Jean-Baptiste Onofré
> >>>>> jbonofre@apache.org
> >>>>> http://blog.nanthrax.net
> >>>>> Talend - http://www.talend.com
> >>>>>
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message