hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Liochon <nkey...@gmail.com>
Subject Re: All builds are recently failing with timeout or fork errors, let's change settings
Date Mon, 26 Jan 2015 16:45:58 GMT
I see in https://builds.apache.org/computer/ubuntu-2/load-statistics (used
for the 0.98 build mentionned by Andrew above) that we have a configuration
with 2 executors.
It means that jenkins tries to run 2 builds in parallel, each of these
builds will trigger its own set of surefire forks.

iirc, in the past:
 - we were not building on these machines, we were using only the hadoop
pool of machines
 - these machines were configured with 1 executor

>From what I see, there are two sets of machines
 - H*, for hadoop projects. H0 (for example) is configured with a single
executor.
 - ubuntu*, for everybody: ubuntu2 (for example) is configured with 2
executors.

0.98 and PreCommit-HBASE-Build are configured with: (ubuntu||Hadoop) &&
!jenkins-cloud-4GB && !H11

So it depends: lucky = H*. Unlucky = ubuntu*

I don't know who changed this, nor why, but may be we should not go to
ubuntu* machines. Or, if it's possible, we should have a different config
for these machines.



On Mon, Jan 19, 2015 at 7:11 PM, Andrew Purtell <andrew.purtell@gmail.com>
wrote:

> The 0.98 build is still showing this problem (latest as of now at
> https://builds.apache.org/job/hbase-0.98/803), so I went ahead and made
> the
> proposed change, but only to the 0.98 builds. I'll let you know if it
> provides any improvement.
>
>
> On Sun, Jan 18, 2015 at 10:00 AM, Andrew Purtell <andrew.purtell@gmail.com
> >
> wrote:
>
> > Forked VMs are being killed in the 0.98 builds. That suggests
> > infrastructure issues.
> >
> > Having only one test execute in a forked runner does mean the finding of
> a
> > zombie and thread dumps or other state from the runner will identify and
> > characterize a sick test with no unrelated state mixed in.
> >
> >
> > > On Jan 17, 2015, at 7:43 PM, Stack <stack@duboce.net> wrote:
> > >
> > > Agree, try anything to get our blues back.  We add back the //ism after
> > all
> > > settles.
> > >
> > > Do you think something has changed in INFRA Andy? Is it more contended?
> > Or,
> > > more likely, is it that we've been committing stuff that has
> destabilized
> > > builds? We had a good streak of blue there for a while. It just took
> some
> > > work fixing breakage and watching jenkins to make sure breakage didn't
> > > sneak in, but we've lapsed for sure.
> > >
> > > St.Ack
> > >
> > >> On Sat, Jan 17, 2015 at 9:19 AM, Dima Spivak <dspivak@cloudera.com>
> > wrote:
> > >>
> > >> Not running tests in parallel will definitely cut down on Surefire
> > >> flakiness (and in contention that sometimes leads to false failures in
> > >> resource-hungry tests), but it will probably also balloon test run
> > times to
> > >> about two hours. Probably worth it in the short term, but we
> > >> eventually need to do something about some of these heavy tests.
> > >>
> > >> -Dima
> > >>
> > >> On Friday, January 16, 2015, Andrew Purtell <andrew.purtell@gmail.com
> >
> > >> wrote:
> > >>
> > >>> You might have missed the larger issue Ted.
> > >>>
> > >>>
> > >>>> On Jan 16, 2015, at 4:48 PM, Ted Yu <yuzhihong@gmail.com
> > >> <javascript:;>>
> > >>> wrote:
> > >>>>
> > >>>> With HBASE-12874, we should get a green build for branch-1.0
> > >>>>
> > >>>> FYI
> > >>>>
> > >>>> On Fri, Jan 16, 2015 at 12:20 PM, Andrew Purtell <
> apurtell@apache.org
> > >>> <javascript:;>>
> > >>>> wrote:
> > >>>>
> > >>>>> See BUILDS-49 tracking issues specifically with 0.98 jobs,
but I
> just
> > >>>>> noticed trunk, branch-1, and branch-1.0 all failed after I
checked
> in
> > >> a
> > >>>>> shell doc fix due to a timeout or fork failure.
> > >>>>>
> > >>>>> I propose we update all Jenkins jobs to not run tests in parallel,
> > >> i.e.
> > >>> add
> > >>>>> "-Dsurefire.firstPartForkCount=1 -Dsurefire.secondPartForkCount=1"
> > >>>>>
> > >>>>> --
> > >>>>> Best regards,
> > >>>>>
> > >>>>>  - Andy
> > >>>>>
> > >>>>> Problems worthy of attack prove their worth by hitting back.
- Piet
> > >> Hein
> > >>>>> (via Tom White)
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message