flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chiwan Park <chiwanp...@apache.org>
Subject Re: [ANNOUNCE] Build Issues Solved
Date Tue, 31 May 2016 09:53:45 GMT
I’ve created a JIRA issue [1] related to KNN test cases. I will send a PR for it.

From my investigation [2], cluster for ML tests have only one taskmanager with 4 slots. Is
2048 insufficient for total number of network numbers? I still think the problem is sharing
ExecutionEnvironment between test cases.

[1]: https://issues.apache.org/jira/browse/FLINK-3994
[2]: https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56

Regards,
Chiwan Park

> On May 31, 2016, at 6:05 PM, Maximilian Michels <mxm@apache.org> wrote:
> 
> Thanks Stephan for the synopsis of our last weeks test instability
> madness. It's sad to see the shortcomings of Maven test plugins but
> another lesson learned is that our testing infrastructure should get a
> bit more attention. We have reached a point several times where our
> tests where inherently instable. Now we saw that even more problems
> were hidden in the dark. I would like to see more maintenance
> dedicated to testing.
> 
> @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull
> request with a systematic fix. Those things are too crucial to be
> fixed on the go. The problems is that Travis reports the number of
> processors to be "32" (which is used for the number of task slots in
> local execution). The network buffers are not adjusted accordingly. We
> should set them correctly in the MiniCluster. Also, we could define an
> upper limit to the number of task slots for tests.
> 
> On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanpark@apache.org> wrote:
>> I think that the tests fail because of sharing ExecutionEnvironment between test
cases. I’m not sure why it is problem, but it is only difference between other ML tests.
>> 
>> I created a hotfix and pushed it to my repository. When it seems fixed [1], I’ll
merge the hotfix to master branch.
>> 
>> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491
>> 
>> Regards,
>> Chiwan Park
>> 
>>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanpark@apache.org> wrote:
>>> 
>>> Maybe it seems about KNN test case which is merged into yesterday. I’ll look
into ML test.
>>> 
>>> Regards,
>>> Chiwan Park
>>> 
>>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <uce@apache.org> wrote:
>>>> 
>>>> Currently, an ML test is reliably failing and occasionally some HA
>>>> tests. Is someone looking into the ML test?
>>>> 
>>>> For HA, I will revert a commit, which might cause the HA
>>>> instabilities. Till is working on a proper fix as far as I know.
>>>> 
>>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanpark@apache.org>
wrote:
>>>>> Thanks for the great work! :-)
>>>>> 
>>>>> Regards,
>>>>> Chiwan Park
>>>>> 
>>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <pompermaier@okkam.it>
wrote:
>>>>>> 
>>>>>> Awesome work guys!
>>>>>> And even more thanks for the detailed report...This troubleshooting
summary
>>>>>> will be undoubtedly useful for all our maven projects!
>>>>>> 
>>>>>> Best,
>>>>>> Flavio
>>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <uce@apache.org> wrote:
>>>>>> 
>>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green
light again.
>>>>>>> 
>>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <sewen@apache.org>
wrote:
>>>>>>>> Hi all!
>>>>>>>> 
>>>>>>>> After a few weeks of terrible build issues, I am happy to
announce that
>>>>>>> the
>>>>>>>> build works again properly, and we actually get meaningful
CI results.
>>>>>>>> 
>>>>>>>> Here is a story in many acts, from builds deep red to bright
green joy.
>>>>>>>> Kudos to Max, who did most of this troubleshooting. This
evening, Max and
>>>>>>>> me debugged the final issue and got the build back on track.
>>>>>>>> 
>>>>>>>> ------------------
>>>>>>>> The Journey
>>>>>>>> ------------------
>>>>>>>> 
>>>>>>>> (1) Failsafe Plugin
>>>>>>>> 
>>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to
which failed
>>>>>>>> tests did not result in a failed build.
>>>>>>>> 
>>>>>>>> That is a pretty bad bug for a plugin whose only task is
to run tests and
>>>>>>>> fail the build if a test fails.
>>>>>>>> 
>>>>>>>> After we recognized that, we upgraded the Failsafe Plugin.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> (2) Failsafe Plugin Dependency Issues
>>>>>>>> 
>>>>>>>> After the upgrade, the Failsafe Plugin behaved differently
and did not
>>>>>>>> interoperate with Dependency Shading any more.
>>>>>>>> 
>>>>>>>> Because of that, we switched to the Surefire Plugin.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> (3) Fixing all the issues introduced in the meantime
>>>>>>>> 
>>>>>>>> Naturally, a number of test instabilities had been introduced,
which
>>>>>>> needed
>>>>>>>> to be fixed.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> (4) Yarn Tests and Test Scope Refactoring
>>>>>>>> 
>>>>>>>> In the meantime, a Pull Request was merged that moved the
Yarn Tests to
>>>>>>> the
>>>>>>>> test scope.
>>>>>>>> Because the configuration searched for tests in the "main"
scope, no Yarn
>>>>>>>> tests were executed for a while, until the scope was fixed.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> (5) Yarn Tests and JMX Metrics
>>>>>>>> 
>>>>>>>> After the Yarn Tests were re-activated, we saw them fail
due to warnings
>>>>>>>> created by the newly introduced metrics code. We could fix
that by
>>>>>>> updating
>>>>>>>> the metrics code and temporarily not registering JMX beans
for all
>>>>>>> metrics.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> (6) Yarn / Surefire Deadlock
>>>>>>>> 
>>>>>>>> Finally, some Yarn tests failed reliably in Maven (though
not in the
>>>>>>> IDE).
>>>>>>>> It turned out that those test a command line interface that
interacts
>>>>>>> with
>>>>>>>> the standard input stream.
>>>>>>>> 
>>>>>>>> The newly deployed Surefire Plugin uses standard input as
well, for
>>>>>>>> communication with forked JVMs. Since Surefire internally
locks the
>>>>>>>> standard input stream, the Yarn CLI cannot poll the standard
input stream
>>>>>>>> without locking up and stalling the tests.
>>>>>>>> 
>>>>>>>> We adjusted the tests and now the build happily builds again.
>>>>>>>> 
>>>>>>>> -----------------
>>>>>>>> Conclusions:
>>>>>>>> -----------------
>>>>>>>> 
>>>>>>>> - CI is terribly crucial It took us weeks with the fallout
of having a
>>>>>>>> period of unreliably CI.
>>>>>>>> 
>>>>>>>> - Maven could do a better job. A bug as crucial as the one
that started
>>>>>>>> our problem should not occur in a test plugin like surefire.
Also, the
>>>>>>>> constant change of semantics and dependency scopes is annoying.
The
>>>>>>>> semantic changes are subtle, but for a build as complex as
Flink, they
>>>>>>> make
>>>>>>>> a difference.
>>>>>>>> 
>>>>>>>> - File-based communication is rarely a good idea. The bug
in the
>>>>>>> failsafe
>>>>>>>> plugin was caused by improper file-based communication, and
some of our
>>>>>>>> discovered instabilities as well.
>>>>>>>> 
>>>>>>>> Greetings,
>>>>>>>> Stephan
>>>>>>>> 
>>>>>>>> 
>>>>>>>> PS: Some issues and mysteries remain for us to solve: When
we allow our
>>>>>>>> metrics subsystem to register JMX beans, we see some tests
failing due to
>>>>>>>> spontaneous JVM process kills. Whoever has a pointer there,
please ping
>>>>>>> us!
>>>>>>> 
>>>>> 
>>> 
>> 


Mime
View raw message