spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shane knapp <skn...@berkeley.edu>
Subject Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06
Date Tue, 20 Oct 2015 22:46:30 GMT
amp-jenkins-worker-06 is back up.

my next bets are on -07 and -08...  :\

https://amplab.cs.berkeley.edu/jenkins/computer/

On Tue, Oct 20, 2015 at 3:39 PM, shane knapp <sknapp@berkeley.edu> wrote:
> here's the related stack trace from dmesg...  UID 500 is jenkins.
>
> Out of memory: Kill process 142764 (java) score 40 or sacrifice child
> Killed process 142764, UID 500, (java) total-vm:24685036kB,
> anon-rss:5730824kB, file-rss:64kB
> Uhhuh. NMI received for unknown reason 21 on CPU 0.
> Do you have a strange power saving mode enabled?
> Dazed and confused, but trying to continue
> java: page allocation failure. order:2, mode:0xd0
> Pid: 142764, comm: java Not tainted 2.6.32-573.3.1.el6.x86_64 #1
> Call Trace:
>  [<ffffffff8113770c>] ? __alloc_pages_nodemask+0x7dc/0x950
>  [<ffffffff81074fa8>] ? copy_process+0x168/0x1530
>  [<ffffffff810764c6>] ? do_fork+0x96/0x4c0
>  [<ffffffff810b828b>] ? sys_futex+0x7b/0x170
>  [<ffffffff81009598>] ? sys_clone+0x28/0x30
>  [<ffffffff8100b3f3>] ? stub_clone+0x13/0x20
>  [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
>
> On Tue, Oct 20, 2015 at 3:35 PM, shane knapp <sknapp@berkeley.edu> wrote:
>> -06 just kinda came back...
>>
>> [root@amp-jenkins-worker-06 ~]# uptime
>>  15:29:07 up 26 days,  7:34,  2 users,  load average: 1137.91, 1485.69, 1635.89
>>
>> the builds that, from looking at the process table, seem to be at
>> fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly
>> a Spark-Master-SBT matrix build.  look at the build history here:
>> https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds
>>
>> the load is dropping significantly and quickly, but swap is borked and
>> i'm still going to reboot.
>>
>> On Tue, Oct 20, 2015 at 3:24 PM, shane knapp <sknapp@berkeley.edu> wrote:
>>> starting this saturday (oct 17) we started getting alerts on the
>>> jenkins workers that various processes were dying (specifically ssh).
>>>
>>> since then, we've had half of our workers OOM due to java processes
>>> and have had now to reboot two of them (-05 and -06).
>>>
>>> if we look at the current machine that's wedged (amp-jenkins-worker-06), we see:
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/
>>>
>>> have there been any changes to any of these builds that might have
>>> caused this?  anyone have any ideas?
>>>
>>> sadly, even though i saw that -06 was about to OOM and got a shell
>>> opened before SSH died, my command prompt is completely unresponsive.
>>> :(
>>>
>>> shane

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message