infra-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <>
Subject [jira] [Commented] (INFRA-15373) Various Jenkins bits are in trouble
Date Sat, 28 Oct 2017 13:00:03 GMT


Allen Wittenauer commented on INFRA-15373:

This is bad news.

We just had a job ( lose
access to jenkins during the run.  This job had the Apache Yetus resource controls wrapped
around it (memory, process limits), so there's pretty much no way we caused the agent to disconnect.

In fact, the job continued to run (since Jenkins doesn't kill Docker containers) and it later
on reported that it hit either the process or memory limit that Yetus put in place.

I'm inclined to believe there are some other bad actors out there, someone else kicked Jenkins,
or it's a bug in the Jenkins/OS setup.  There's no build history on that node (H4) now (at
least, that's showing up for me in the UI), so I have no idea what else was running while
this job was.

I don't think I'm personally going to be able to do much more to troubleshoot these issues
without better access to process and memory information.  I'll obviously push for the Yetus
community to adopt 0.7.0 once it comes out. But other than that, I'm not sure what else to

> Various Jenkins bits are in trouble
> -----------------------------------
>                 Key: INFRA-15373
>                 URL:
>             Project: Infrastructure
>          Issue Type: Bug
>          Components: Jenkins
>            Reporter: Allen Wittenauer
>            Assignee: Daniel Takamori
>            Priority: Blocker
> It looks like Jenkins is no longer scheduling jobs.  I attempted to restart the Jenkins
agent via the UI on H4 (see below) and that's when things appears to have stopped getting
> Also, nodes H1, H2, H10, and H12 need to be kicked.
> Additionally, I think I've discovered that the Hadoop HDFS unit tests for branch-2 are
causing havoc on build nodes. ( tracking in HDFS-12711)  I'm at 50% condfidence the problem
is OOM-killer related.  In majority cases, the nodes become unavailable entirely from Jenkins.
 Today, H4 reported back data from inside the container which meant that it wasn't a kernel
panic.  So I did a restart of the Jenkins agent but it still never fully came back from what
I can tell.
> In any case, I'm still trying to reproduce the problem locally but it's tough going.
 I'm going to hard set which nodes certain tests run on to try and limit the damage though.
 Additionally, I've been working on YETUS-561 in case it is OOM related.  From experiments,
that seems to work when OOM actually is the problem: unit tests and the like are sufficiently
> Anyway, sorry for the issues and thanks for the help.

This message was sent by Atlassian JIRA

View raw message