infra-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Wegner (JIRA)" <>
Subject [jira] [Commented] (INFRA-16380) Nodes Failing With "Backing Channel '<slave_name>' is disconnected.
Date Wed, 18 Apr 2018 23:19:00 GMT


Scott Wegner commented on INFRA-16380:

[~cml] thanks for the details, and setting up a monitoring dashboard sounds perfect. I'm not
familiar with the capabilities of Data Dog, but as Jason mentioned: the more data the better.

For the current task of build migration, here's what I'm grappling with: Gradle does much
more task parallelization than we saw with our previous build, and is spinning up so many
workers that it exhausts machine memory. We have a number of different Jenkins jobs that all
do different subsets of build/testing and will each have its own performance characteristics.
I need to figure out the right tuning parameters for the build to make sure memory is safely
constrained while not over-throttling build performance.

For the above problem, here's an ideal workflow:
a) Kick off a particular Jenkins job
b) While it's running, view all processes and subprocesses launched as part of the job, with
commandline parameters and a timeline of the current and historical memory usage
c) Look for times where memory usage peaked above some limit, and drill into what was going
on with as much correlated context as possible, such as heap dump, gradle tasks status, log
files, etc.
d) Based on investigation, tweak some build parameters and iterate until we're happy.

I realize that's probably a pipe dream, but any portion that you could enable would be amazing

> Nodes Failing With "Backing Channel '<slave_name>' is disconnected.
> -------------------------------------------------------------------
>                 Key: INFRA-16380
>                 URL:
>             Project: Infrastructure
>          Issue Type: Bug
>          Components: Jenkins
>            Reporter: Jason Kuster
>            Priority: Major
> We've seen a couple of Jenkins builds dying with the cited error. They generally look
like the following:
> {code}
> FATAL: command execution failed
> Backing channel 'beam2' is disconnected.
> 	at hudson.remoting.RemoteInvocationHandler.channelOrFail(
> 	at hudson.remoting.RemoteInvocationHandler.invoke(
> 	at com.sun.proxy.$Proxy131.isAlive(Unknown Source)
> 	at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(
> 	at hudson.Launcher$RemoteLauncher$ProcImpl.join(
> 	at hudson.Launcher$ProcStarter.join(
> 	at hudson.plugins.gradle.Gradle.performTask(
> 	at hudson.plugins.gradle.Gradle.perform(
> 	at hudson.tasks.BuildStepMonitor$1.perform(
> 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(
> 	at hudson.model.Build$
> 	at hudson.model.Build$BuildExecution.doRun(
> 	at hudson.model.AbstractBuild$
> 	at hudson.model.Run.execute(
> 	at
> 	at hudson.model.ResourceController.execute(
> 	at
> Caused by: Unexpected termination of the channel
> 	at hudson.remoting.SynchronousCommandTransport$
> Caused by:
> 	at$PeekInputStream.readFully(
> 	at$BlockDataInputStream.readShort(
> 	at
> 	at<init>(
> 	at hudson.remoting.ObjectInputStreamEx.<init>(
> 	at
> 	at hudson.remoting.SynchronousCommandTransport$
> Build step 'Invoke Gradle script' changed build result to FAILURE
> Build step 'Invoke Gradle script' marked build as failure
> ERROR: Step ‘Publish JUnit test result report’ failed: no workspace for beam_PreCommit_Java_GradleBuild
> ERROR: beam2 is offline; cannot locate JDK 1.8 (latest)
> ERROR: beam2 is offline; cannot locate JDK 1.8 (latest)
> ERROR: beam2 is offline; cannot locate JDK 1.8 (latest)
> ERROR: beam2 is offline; cannot locate JDK 1.8 (latest)
> Setting status of 1073baaaa633dd34ed552812e65108944eb92ac6 to FAILURE with url
and message: 'FAILURE
>  '
> Using context: Jenkins: ./gradlew :javaPreCommit
> ERROR: beam2 is offline; cannot locate JDK 1.8 (latest)
> Finished: FAILURE
> {code}

This message was sent by Atlassian JIRA

View raw message