hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: All builds are recently failing with timeout or fork errors, let's change settings
Date Mon, 26 Jan 2015 17:57:04 GMT
I will change the number of executors for the 0.98 builds to 1. Thanks for
the tip, N!


On Mon, Jan 26, 2015 at 8:45 AM, Nicolas Liochon <nkeywal@gmail.com> wrote:

> I see in https://builds.apache.org/computer/ubuntu-2/load-statistics (used
> for the 0.98 build mentionned by Andrew above) that we have a configuration
> with 2 executors.
> It means that jenkins tries to run 2 builds in parallel, each of these
> builds will trigger its own set of surefire forks.
>
> iirc, in the past:
>  - we were not building on these machines, we were using only the hadoop
> pool of machines
>  - these machines were configured with 1 executor
>
> From what I see, there are two sets of machines
>  - H*, for hadoop projects. H0 (for example) is configured with a single
> executor.
>  - ubuntu*, for everybody: ubuntu2 (for example) is configured with 2
> executors.
>
> 0.98 and PreCommit-HBASE-Build are configured with: (ubuntu||Hadoop) &&
> !jenkins-cloud-4GB && !H11
>
> So it depends: lucky = H*. Unlucky = ubuntu*
>
> I don't know who changed this, nor why, but may be we should not go to
> ubuntu* machines. Or, if it's possible, we should have a different config
> for these machines.
>
>
>
> On Mon, Jan 19, 2015 at 7:11 PM, Andrew Purtell <andrew.purtell@gmail.com>
> wrote:
>
> > The 0.98 build is still showing this problem (latest as of now at
> > https://builds.apache.org/job/hbase-0.98/803), so I went ahead and made
> > the
> > proposed change, but only to the 0.98 builds. I'll let you know if it
> > provides any improvement.
> >
> >
> > On Sun, Jan 18, 2015 at 10:00 AM, Andrew Purtell <
> andrew.purtell@gmail.com
> > >
> > wrote:
> >
> > > Forked VMs are being killed in the 0.98 builds. That suggests
> > > infrastructure issues.
> > >
> > > Having only one test execute in a forked runner does mean the finding
> of
> > a
> > > zombie and thread dumps or other state from the runner will identify
> and
> > > characterize a sick test with no unrelated state mixed in.
> > >
> > >
> > > > On Jan 17, 2015, at 7:43 PM, Stack <stack@duboce.net> wrote:
> > > >
> > > > Agree, try anything to get our blues back.  We add back the //ism
> after
> > > all
> > > > settles.
> > > >
> > > > Do you think something has changed in INFRA Andy? Is it more
> contended?
> > > Or,
> > > > more likely, is it that we've been committing stuff that has
> > destabilized
> > > > builds? We had a good streak of blue there for a while. It just took
> > some
> > > > work fixing breakage and watching jenkins to make sure breakage
> didn't
> > > > sneak in, but we've lapsed for sure.
> > > >
> > > > St.Ack
> > > >
> > > >> On Sat, Jan 17, 2015 at 9:19 AM, Dima Spivak <dspivak@cloudera.com>
> > > wrote:
> > > >>
> > > >> Not running tests in parallel will definitely cut down on Surefire
> > > >> flakiness (and in contention that sometimes leads to false failures
> in
> > > >> resource-hungry tests), but it will probably also balloon test run
> > > times to
> > > >> about two hours. Probably worth it in the short term, but we
> > > >> eventually need to do something about some of these heavy tests.
> > > >>
> > > >> -Dima
> > > >>
> > > >> On Friday, January 16, 2015, Andrew Purtell <
> andrew.purtell@gmail.com
> > >
> > > >> wrote:
> > > >>
> > > >>> You might have missed the larger issue Ted.
> > > >>>
> > > >>>
> > > >>>> On Jan 16, 2015, at 4:48 PM, Ted Yu <yuzhihong@gmail.com
> > > >> <javascript:;>>
> > > >>> wrote:
> > > >>>>
> > > >>>> With HBASE-12874, we should get a green build for branch-1.0
> > > >>>>
> > > >>>> FYI
> > > >>>>
> > > >>>> On Fri, Jan 16, 2015 at 12:20 PM, Andrew Purtell <
> > apurtell@apache.org
> > > >>> <javascript:;>>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> See BUILDS-49 tracking issues specifically with 0.98 jobs,
but I
> > just
> > > >>>>> noticed trunk, branch-1, and branch-1.0 all failed after
I
> checked
> > in
> > > >> a
> > > >>>>> shell doc fix due to a timeout or fork failure.
> > > >>>>>
> > > >>>>> I propose we update all Jenkins jobs to not run tests
in
> parallel,
> > > >> i.e.
> > > >>> add
> > > >>>>> "-Dsurefire.firstPartForkCount=1
> -Dsurefire.secondPartForkCount=1"
> > > >>>>>
> > > >>>>> --
> > > >>>>> Best regards,
> > > >>>>>
> > > >>>>>  - Andy
> > > >>>>>
> > > >>>>> Problems worthy of attack prove their worth by hitting
back. -
> Piet
> > > >> Hein
> > > >>>>> (via Tom White)
> > > >>
> > >
> >
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message