flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Current master broken?
Date Mon, 16 Mar 2015 08:04:41 GMT
The reason is probably that Travis used different VMs then and now. The
test did not specify the parallelism itself, but picked it up from the
hardware. When the VMs changed, the parallelism changed (increased a lot)
to the point where the default environment has no more network buffers.

Simple fix is to use either test bases, or specify the parallelism directly.
Am 15.03.2015 19:24 schrieb "Vasiliki Kalavri" <vasilikikalavri@gmail.com>:

> Hi,
>
> thanks a lot for fixing this so quickly!
>
> Just to make sure I fully understand what happened, how come the travis
> build was successful when we merged this, but failed later?
> Is there a way to avoid such issues in the future?
>
> Cheers,
> Vasia.
>
>
> On 15 March 2015 at 17:07, Stephan Ewen <sewen@apache.org> wrote:
>
> > Waiting for travis to give me the green light, then I'll push the fix...
> >
> > On Sun, Mar 15, 2015 at 5:04 PM, Robert Metzger <rmetzger@apache.org>
> > wrote:
> >
> > > I think the issue is that our tests are executed on travis machines
> with
> > > different physical CPU core counts.
> > >
> > > I've pushed a 5 days old commit (
> > >
> > >
> >
> https://github.com/rmetzger/flink/commit/b4e8350f52c81704ffc726a1689bb0dc7180776d
> > > )
> > > to travis, and it also failed with that issue:
> > > https://travis-ci.org/rmetzger/flink/builds/54443951
> > >
> > > Thanks for resolving the issue so quickly Stephan!
> > >
> > > On Sun, Mar 15, 2015 at 4:06 PM, Andra Lungu <lungu.andra@gmail.com>
> > > wrote:
> > >
> > > > Hi Stephan,
> > > >
> > > > The degree of parallelism was manually set there.
> > > MultipleProgramsTestBase
> > > > cannot be extended; Ufuk explained why.
> > > >
> > > > But I see that for the latest travis check, that test passed.
> > > > https://github.com/apache/flink/pull/475
> > > >
> > > > On Sun, Mar 15, 2015 at 3:54 PM, Stephan Ewen <sewen@apache.org>
> > wrote:
> > > >
> > > > > Cause of the Failures:
> > > > >
> > > > > The tests in DegreesWithExceptionITCase use the context execution
> > > > > environment without extending a test base. This context environment
> > > > > instantiates a local excution environment with a parallelism equal
> to
> > > the
> > > > > number of cores. Since on travis, build run in containers on big
> > > > machines,
> > > > > the number of cores may be very high 32/64 - this causes the tests
> to
> > > run
> > > > > out of network buffers, with the default configuration.
> > > > >
> > > > >
> > > > > IMPORTANT: Please make sure that all tests in the future either use
> > one
> > > > of
> > > > > the test base classes (that define a reasonable parallelism), or
> > define
> > > > the
> > > > > parallelism manually to be safe!
> > > > >
> > > > > On Sun, Mar 15, 2015 at 3:43 PM, Stephan Ewen <sewen@apache.org>
> > > wrote:
> > > > >
> > > > > > It seems that the current master is broken, with respect to
the
> > > tests.
> > > > > >
> > > > > > I see all build on Travis consistently failing, in the gelly
> > project.
> > > > > > Since Travis is a bit behind in the "apache" account, I
> triggered a
> > > > build
> > > > > > in my own account. The hash is the same, it should contain the
> > master
> > > > > from
> > > > > > yesterday.
> > > > > >
> > > > > >
> https://travis-ci.org/StephanEwen/incubator-flink/builds/54386416
> > > > > >
> > > > > > In all executions it results in the stack trace below. I cannot
> > > > reproduce
> > > > > > the problem locally, unfortunately.
> > > > > >
> > > > > > This is a serious issue, it totally kills the testability.
> > > > > >
> > > > > > Results :
> > > > > >
> > > > > > Failed tests:
> > > > > >   DegreesWithExceptionITCase.testGetDegreesInvalidEdgeSrcId:113
> > > > > expected:<[The edge src/trg id could not be found within the
> > > vertexIds]>
> > > > > but was:<[Failed to deploy the task Reduce(SUM(1), at
> > > > > getDegrees(Graph.java:664) (30/32) - execution #0 to slot
> SimpleSlot
> > > > (2)(2)
> > > > > - 31624115d75feb2c387ae9043021d8e6 - ALLOCATED/ALIVE:
> > > > java.io.IOException:
> > > > > Insufficient number of network buffers: required 32, but only 2
> > > > available.
> > > > > The total number of network buffers is currently set to 2048. You
> can
> > > > > increase this number by setting the configuration key
> > > > > 'taskmanager.network.numberOfBuffers'.
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:158)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:163)
> > > > > >       at org.apache.flink.runtime.taskmanager.TaskManager.org
> > > > >
> > > >
> > >
> >
> $apache$flink$runtime$taskmanager$TaskManager$$submitTask(TaskManager.scala:454)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:237)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
> > > > > >       at
> > > > > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
> > > > > >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:91)
> > > > > >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> > > > > >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> > > > > >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> > > > > >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> > > > > >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> > > > > >       at
> > > > >
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> > > > > >       at
> > > > >
> > >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > > > > > ]>
> > > > > >   DegreesWithExceptionITCase.testGetDegreesInvalidEdgeTrgId:92
> > > > > expected:<[The edge src/trg id could not be found within the
> > > vertexIds]>
> > > > > but was:<[Failed to deploy the task CoGroup (CoGroup at
> > > > > inDegrees(Graph.java:655)) (29/32) - execution #0 to slot
> SimpleSlot
> > > > (1)(3)
> > > > > - 1735ca6f2fb76f9f0a0ab03ffd9c9f93 - ALLOCATED/ALIVE:
> > > > java.io.IOException:
> > > > > Insufficient number of network buffers: required 32, but only 8
> > > > available.
> > > > > The total number of network buffers is currently set to 2048. You
> can
> > > > > increase this number by setting the configuration key
> > > > > 'taskmanager.network.numberOfBuffers'.
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:158)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:135)
> > > > > >       at org.apache.flink.runtime.taskmanager.TaskManager.org
> > > > >
> > > >
> > >
> >
> $apache$flink$runtime$taskmanager$TaskManager$$submitTask(TaskManager.scala:454)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:237)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
> > > > > >       at
> > > > > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
> > > > > >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:91)
> > > > > >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> > > > > >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> > > > > >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> > > > > >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> > > > > >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> > > > > >       at
> > > > >
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> > > > > >       at
> > > > >
> > >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > > > > > ]>
> > > > > >
>  DegreesWithExceptionITCase.testGetDegreesInvalidEdgeSrcTrgId:134
> > > > > expected:<[The edge src/trg id could not be found within the
> > > vertexIds]>
> > > > > but was:<[Failed to deploy the task CoGroup (CoGroup at
> > > > > inDegrees(Graph.java:655)) (31/32) - execution #0 to slot
> SimpleSlot
> > > > (1)(3)
> > > > > - 3a465bdbeca9625e5d90572ed0959b1d - ALLOCATED/ALIVE:
> > > > java.io.IOException:
> > > > > Insufficient number of network buffers: required 32, but only 8
> > > > available.
> > > > > The total number of network buffers is currently set to 2048. You
> can
> > > > > increase this number by setting the configuration key
> > > > > 'taskmanager.network.numberOfBuffers'.
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:158)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:135)
> > > > > >       at org.apache.flink.runtime.taskmanager.TaskManager.org
> > > > >
> > > >
> > >
> >
> $apache$flink$runtime$taskmanager$TaskManager$$submitTask(TaskManager.scala:454)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:237)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
> > > > > >       at
> > > > > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
> > > > > >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:91)
> > > > > >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> > > > > >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> > > > > >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> > > > > >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> > > > > >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> > > > > >       at
> > > > >
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> > > > > >       at
> > > > >
> > >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > > > > >       at
> > > > >
> > > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > > > > > ]>
> > > > > >
> > > > > > Tests run: 180, Failures: 3, Errors: 0, Skipped: 0
> > > > > >
> > > > > > [INFO]
> > > > > > [INFO] --- maven-failsafe-plugin:2.17:verify (default) @
> > flink-gelly
> > > > ---
> > > > > > [INFO] Failsafe report directory:
> > > > >
> > > >
> > >
> >
> /home/travis/build/StephanEwen/incubator-flink/flink-staging/flink-gelly/target/failsafe-reports
> > > > > > [INFO]
> > > > >
> > >
> ------------------------------------------------------------------------
> > > > > > [INFO] Reactor Summary:
> > > > > > [INFO]
> > > > > > [INFO] flink ..............................................
> > SUCCESS [
> > > > > 6.075 s]
> > > > > > [INFO] flink-shaded-hadoop ................................
> > SUCCESS [
> > > > > 1.827 s]
> > > > > > [INFO] flink-shaded-hadoop1 ...............................
> > SUCCESS [
> > > > > 7.384 s]
> > > > > > [INFO] flink-core .........................................
> > SUCCESS [
> > > > > 37.973 s]
> > > > > > [INFO] flink-java .........................................
> > SUCCESS [
> > > > > 17.373 s]
> > > > > > [INFO] flink-runtime ......................................
> SUCCESS
> > > > > [11:13 min]
> > > > > > [INFO] flink-compiler .....................................
> > SUCCESS [
> > > > > 7.149 s]
> > > > > > [INFO] flink-clients ......................................
> > SUCCESS [
> > > > > 9.130 s]
> > > > > > [INFO] flink-test-utils ...................................
> > SUCCESS [
> > > > > 8.519 s]
> > > > > > [INFO] flink-scala ........................................
> > SUCCESS [
> > > > > 36.171 s]
> > > > > > [INFO] flink-examples .....................................
> > SUCCESS [
> > > > > 0.370 s]
> > > > > > [INFO] flink-java-examples ................................
> > SUCCESS [
> > > > > 2.335 s]
> > > > > > [INFO] flink-scala-examples ...............................
> > SUCCESS [
> > > > > 25.139 s]
> > > > > > [INFO] flink-staging ......................................
> > SUCCESS [
> > > > > 0.093 s]
> > > > > > [INFO] flink-streaming ....................................
> > SUCCESS [
> > > > > 0.315 s]
> > > > > > [INFO] flink-streaming-core ...............................
> > SUCCESS [
> > > > > 9.560 s]
> > > > > > [INFO] flink-tests ........................................
> SUCCESS
> > > > > [09:11 min]
> > > > > > [INFO] flink-avro .........................................
> > SUCCESS [
> > > > > 17.307 s]
> > > > > > [INFO] flink-jdbc .........................................
> > SUCCESS [
> > > > > 3.715 s]
> > > > > > [INFO] flink-spargel ......................................
> > SUCCESS [
> > > > > 7.141 s]
> > > > > > [INFO] flink-hadoop-compatibility .........................
> > SUCCESS [
> > > > > 19.508 s]
> > > > > > [INFO] flink-streaming-scala ..............................
> > SUCCESS [
> > > > > 14.936 s]
> > > > > > [INFO] flink-streaming-connectors .........................
> > SUCCESS [
> > > > > 2.784 s]
> > > > > > [INFO] flink-streaming-examples ...........................
> > SUCCESS [
> > > > > 18.787 s]
> > > > > > [INFO] flink-hbase ........................................
> > SUCCESS [
> > > > > 2.870 s]
> > > > > > [INFO] flink-gelly ........................................
> > FAILURE [
> > > > > 58.548 s]
> > > > > > [INFO] flink-hcatalog .....................................
> SKIPPED
> > > > > > [INFO] flink-expressions ..................................
> SKIPPED
> > > > > > [INFO] flink-quickstart ...................................
> SKIPPED
> > > > > > [INFO] flink-quickstart-java ..............................
> SKIPPED
> > > > > > [INFO] flink-quickstart-scala .............................
> SKIPPED
> > > > > > [INFO] flink-contrib ......................................
> SKIPPED
> > > > > > [INFO] flink-dist .........................................
> SKIPPED
> > > > > > [INFO]
> > > > >
> > >
> ------------------------------------------------------------------------
> > > > > > [INFO] BUILD FAILURE
> > > > > > [INFO]
> > > > >
> > >
> ------------------------------------------------------------------------
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message