hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: All builds are recently failing with timeout or fork errors, let's change settings
Date Mon, 26 Jan 2015 17:59:28 GMT
I removed "ubuntu" from the label expression for the 0.98 builds. This
dropped the available pool of nodes for these jobs from 17 to 10, but the
exchange in job stability, if it pans out, will be worth it.


On Mon, Jan 26, 2015 at 9:57 AM, Andrew Purtell <apurtell@apache.org> wrote:

> I will change the number of executors for the 0.98 builds to 1. Thanks for
> the tip, N!
>
>
> On Mon, Jan 26, 2015 at 8:45 AM, Nicolas Liochon <nkeywal@gmail.com>
> wrote:
>
>> I see in https://builds.apache.org/computer/ubuntu-2/load-statistics
>> (used
>> for the 0.98 build mentionned by Andrew above) that we have a
>> configuration
>> with 2 executors.
>> It means that jenkins tries to run 2 builds in parallel, each of these
>> builds will trigger its own set of surefire forks.
>>
>> iirc, in the past:
>>  - we were not building on these machines, we were using only the hadoop
>> pool of machines
>>  - these machines were configured with 1 executor
>>
>> From what I see, there are two sets of machines
>>  - H*, for hadoop projects. H0 (for example) is configured with a single
>> executor.
>>  - ubuntu*, for everybody: ubuntu2 (for example) is configured with 2
>> executors.
>>
>> 0.98 and PreCommit-HBASE-Build are configured with: (ubuntu||Hadoop) &&
>> !jenkins-cloud-4GB && !H11
>>
>> So it depends: lucky = H*. Unlucky = ubuntu*
>>
>> I don't know who changed this, nor why, but may be we should not go to
>> ubuntu* machines. Or, if it's possible, we should have a different config
>> for these machines.
>>
>>
>>
>> On Mon, Jan 19, 2015 at 7:11 PM, Andrew Purtell <andrew.purtell@gmail.com
>> >
>> wrote:
>>
>> > The 0.98 build is still showing this problem (latest as of now at
>> > https://builds.apache.org/job/hbase-0.98/803), so I went ahead and made
>> > the
>> > proposed change, but only to the 0.98 builds. I'll let you know if it
>> > provides any improvement.
>> >
>> >
>> > On Sun, Jan 18, 2015 at 10:00 AM, Andrew Purtell <
>> andrew.purtell@gmail.com
>> > >
>> > wrote:
>> >
>> > > Forked VMs are being killed in the 0.98 builds. That suggests
>> > > infrastructure issues.
>> > >
>> > > Having only one test execute in a forked runner does mean the finding
>> of
>> > a
>> > > zombie and thread dumps or other state from the runner will identify
>> and
>> > > characterize a sick test with no unrelated state mixed in.
>> > >
>> > >
>> > > > On Jan 17, 2015, at 7:43 PM, Stack <stack@duboce.net> wrote:
>> > > >
>> > > > Agree, try anything to get our blues back.  We add back the //ism
>> after
>> > > all
>> > > > settles.
>> > > >
>> > > > Do you think something has changed in INFRA Andy? Is it more
>> contended?
>> > > Or,
>> > > > more likely, is it that we've been committing stuff that has
>> > destabilized
>> > > > builds? We had a good streak of blue there for a while. It just took
>> > some
>> > > > work fixing breakage and watching jenkins to make sure breakage
>> didn't
>> > > > sneak in, but we've lapsed for sure.
>> > > >
>> > > > St.Ack
>> > > >
>> > > >> On Sat, Jan 17, 2015 at 9:19 AM, Dima Spivak <dspivak@cloudera.com
>> >
>> > > wrote:
>> > > >>
>> > > >> Not running tests in parallel will definitely cut down on Surefire
>> > > >> flakiness (and in contention that sometimes leads to false
>> failures in
>> > > >> resource-hungry tests), but it will probably also balloon test
run
>> > > times to
>> > > >> about two hours. Probably worth it in the short term, but we
>> > > >> eventually need to do something about some of these heavy tests.
>> > > >>
>> > > >> -Dima
>> > > >>
>> > > >> On Friday, January 16, 2015, Andrew Purtell <
>> andrew.purtell@gmail.com
>> > >
>> > > >> wrote:
>> > > >>
>> > > >>> You might have missed the larger issue Ted.
>> > > >>>
>> > > >>>
>> > > >>>> On Jan 16, 2015, at 4:48 PM, Ted Yu <yuzhihong@gmail.com
>> > > >> <javascript:;>>
>> > > >>> wrote:
>> > > >>>>
>> > > >>>> With HBASE-12874, we should get a green build for branch-1.0
>> > > >>>>
>> > > >>>> FYI
>> > > >>>>
>> > > >>>> On Fri, Jan 16, 2015 at 12:20 PM, Andrew Purtell <
>> > apurtell@apache.org
>> > > >>> <javascript:;>>
>> > > >>>> wrote:
>> > > >>>>
>> > > >>>>> See BUILDS-49 tracking issues specifically with 0.98
jobs, but I
>> > just
>> > > >>>>> noticed trunk, branch-1, and branch-1.0 all failed
after I
>> checked
>> > in
>> > > >> a
>> > > >>>>> shell doc fix due to a timeout or fork failure.
>> > > >>>>>
>> > > >>>>> I propose we update all Jenkins jobs to not run tests
in
>> parallel,
>> > > >> i.e.
>> > > >>> add
>> > > >>>>> "-Dsurefire.firstPartForkCount=1
>> -Dsurefire.secondPartForkCount=1"
>> > > >>>>>
>> > > >>>>> --
>> > > >>>>> Best regards,
>> > > >>>>>
>> > > >>>>>  - Andy
>> > > >>>>>
>> > > >>>>> Problems worthy of attack prove their worth by hitting
back. -
>> Piet
>> > > >> Hein
>> > > >>>>> (via Tom White)
>> > > >>
>> > >
>> >
>>
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message