spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shane knapp <skn...@berkeley.edu>
Subject Re: [build system] jenkins got itself wedged...
Date Sat, 20 May 2017 00:43:14 GMT
last update of the week:

things are looking great...  we're GCing happily and staying well
within our memory limits.

i'm going to do one more restart after the two pull request builds
finish to re-enable backups, and call it a weekend.  :)

shane

On Fri, May 19, 2017 at 8:29 AM, shane knapp <sknapp@berkeley.edu> wrote:
> this is hopefully my final email on the subject...   :)
>
> things have seemed to settled down after my GC tuning, and system
> load/cpu usage/memory has been nice and flat all night.  i'll continue
> to keep an eye on things but it looks like we've weathered the worst
> part of the storm.
>
> On Thu, May 18, 2017 at 6:40 PM, shane knapp <sknapp@berkeley.edu> wrote:
>> after needing another restart this afternoon, i did some homework and
>> aggressively twiddled some GC settings[1].  since then, things have
>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>
>> i've attached a screenshot of slightly happier looking graphs.
>>
>> still keeping an eye on things, and hoping that i can go back to being
>> a lurker...  ;)
>>
>> shane
>>
>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>
>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <sknapp@berkeley.edu> wrote:
>>> ok, more updates:
>>>
>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>> balancing ftw.
>>>
>>> 2) the jenkins master is now running on java8, which has moar bettar
>>> GC management under the hood.
>>>
>>> i'll be keeping an eye on this today, and if we start seeing GC
>>> overhead failures, i'll start doing more GC performance tuning.
>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> shane
>>>
>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <sknapp@berkeley.edu> wrote:
>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>>>> getting some error messages in the logs...   looks like jenkins is
>>>> thrashing on GC.
>>>>
>>>> now that i know what's up, i should be able to get this sorted today.
>>>>
>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <sowen@cloudera.com> wrote:
>>>>> I'm not sure if it's related, but I still can't get Jenkins to test PRs.
For
>>>>> example, triggering it through the spark-prs.appspot.com UI gives me...
>>>>>
>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>
>>>>> Internal Server Error
>>>>>
>>>>> That might be from the appspot app though?
>>>>>
>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work, and
I
>>>>> can't reach Jenkins:
>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>>>
>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <sknapp@berkeley.edu>
wrote:
>>>>>>
>>>>>> after another couple of restarts due to high load and system
>>>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>>>
>>>>>> a typo in the jenkins config where the java heap size was configured.
>>>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain
the
>>>>>> random and non-deterministic system hangs we've had over the past
>>>>>> couple of years.
>>>>>>
>>>>>> anyways, it's been corrected and the master seems to be humming along,
>>>>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>>>>> for the rest of the week, but things are looking MUCH better now.
>>>>>>
>>>>>> sorry again for the interruptions in service.
>>>>>>
>>>>>> shane
>>>>>>
>>>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <sknapp@berkeley.edu>
wrote:
>>>>>> > ok, we're back up, system load looks cromulent and we're happily
>>>>>> > building (again).
>>>>>> >
>>>>>> > shane
>>>>>> >
>>>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <sknapp@berkeley.edu>
>>>>>> > wrote:
>>>>>> >> i'm going to need to perform a quick reboot on the jenkins
master.  it
>>>>>> >> looks like it's hung again.
>>>>>> >>
>>>>>> >> sorry about this!
>>>>>> >>
>>>>>> >> shane
>>>>>> >>
>>>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp <sknapp@berkeley.edu>
>>>>>> >> wrote:
>>>>>> >>> ...but just now i started getting alerts on system load,
which was
>>>>>> >>> rather high.  i had to kick jenkins again, and will
keep an eye on the
>>>>>> >>> master and possible need to reboot.
>>>>>> >>>
>>>>>> >>> sorry about the interruption of service...
>>>>>> >>>
>>>>>> >>> shane
>>>>>> >>>
>>>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <sknapp@berkeley.edu>
>>>>>> >>> wrote:
>>>>>> >>>> ...so i kicked it and it's now back up and happily
building.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message