spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcelo Vanzin <van...@cloudera.com>
Subject Re: Tests failing with GC limit exceeded
Date Fri, 06 Jan 2017 00:38:49 GMT
Seems like the OOM is coming from tests, which most probably means
it's not an infrastructure issue. Maybe tests just need more memory
these days and we need to update maven / sbt scripts.

On Thu, Jan 5, 2017 at 1:19 PM, shane knapp <sknapp@berkeley.edu> wrote:
> as of first thing this morning, here's the list of recent GC overhead
> build failures:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70891/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70874/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70842/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70927/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70551/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70835/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70841/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70869/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70598/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70898/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70629/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70644/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70686/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70620/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70871/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70873/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70622/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70837/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70626/console
>
> i haven't really found anything that jumps out at me except perhaps
> auditing/upping the java memory limits across the build.  this seems
> to be a massive shot in the dark, and time consuming, so let's just
> call this a "method of last resort".
>
> looking more closely at the systems themselves, it looked to me that
> there was enough java "garbage" that had accumulated over the last 5
> months (since the last reboot) that system reboots would be a good
> first step.
>
> https://www.youtube.com/watch?v=nn2FB1P_Mn8
>
> over the course of this morning i've been sneaking in worker reboots
> during quiet times...  the ganglia memory graphs look a lot better
> (free memory up, cached memory down!), and i'll keep an eye on things
> over the course of the next few days to see if the build failure
> frequency is effected.
>
> also, i might be scheduling quarterly system reboots if this indeed
> fixes the problem.
>
> shane
>
> On Wed, Jan 4, 2017 at 1:22 PM, shane knapp <sknapp@berkeley.edu> wrote:
>> preliminary findings:  seems to be transient, and affecting 4% of
>> builds from late december until now (which is as far back as we keep
>> build records for the PRB builds).
>>
>>  408 builds
>>   16 builds.gc   <--- failures
>>
>> it's also happening across all workers at about the same rate.
>>
>> and best of all, there seems to be no pattern to which tests are
>> failing (different each time).  i'll look a little deeper and decide
>> what to do next.
>>
>> On Tue, Jan 3, 2017 at 6:49 PM, shane knapp <sknapp@berkeley.edu> wrote:
>>> nope, no changes to jenkins in the past few months.  ganglia graphs
>>> show higher, but not worrying, memory usage on the workers when the
>>> jobs failed...
>>>
>>> i'll take a closer look later tonite/first thing tomorrow morning.
>>>
>>> shane
>>>
>>> On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout <keo@eecs.berkeley.edu>
wrote:
>>>> I've noticed a bunch of the recent builds failing because of GC limits, for
>>>> seemingly unrelated changes (e.g. 70818, 70840, 70842).  Shane, have there
>>>> been any recent changes in the build configuration that might be causing
>>>> this?  Does anyone else have any ideas about what's going on here?
>>>>
>>>> -Kay
>>>>
>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message