Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D430317E40 for ; Mon, 26 Jan 2015 18:00:15 +0000 (UTC) Received: (qmail 88100 invoked by uid 500); 26 Jan 2015 18:00:15 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 88016 invoked by uid 500); 26 Jan 2015 18:00:15 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 88005 invoked by uid 99); 26 Jan 2015 18:00:15 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jan 2015 18:00:15 +0000 Received: from mail-la0-f49.google.com (mail-la0-f49.google.com [209.85.215.49]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 988121A01E4 for ; Mon, 26 Jan 2015 18:00:14 +0000 (UTC) Received: by mail-la0-f49.google.com with SMTP id gf13so9024194lab.8 for ; Mon, 26 Jan 2015 10:00:08 -0800 (PST) X-Received: by 10.112.132.67 with SMTP id os3mr22370702lbb.90.1422295208276; Mon, 26 Jan 2015 10:00:08 -0800 (PST) MIME-Version: 1.0 Received: by 10.25.11.207 with HTTP; Mon, 26 Jan 2015 09:59:28 -0800 (PST) In-Reply-To: References: <1FAC11D9-A08F-46DD-A8F8-68DC6A0AC6C3@gmail.com> From: Andrew Purtell Date: Mon, 26 Jan 2015 09:59:28 -0800 Message-ID: Subject: Re: All builds are recently failing with timeout or fork errors, let's change settings To: "dev@hbase.apache.org" Content-Type: multipart/alternative; boundary=047d7b3a8192a7c080050d91e89d --047d7b3a8192a7c080050d91e89d Content-Type: text/plain; charset=UTF-8 I removed "ubuntu" from the label expression for the 0.98 builds. This dropped the available pool of nodes for these jobs from 17 to 10, but the exchange in job stability, if it pans out, will be worth it. On Mon, Jan 26, 2015 at 9:57 AM, Andrew Purtell wrote: > I will change the number of executors for the 0.98 builds to 1. Thanks for > the tip, N! > > > On Mon, Jan 26, 2015 at 8:45 AM, Nicolas Liochon > wrote: > >> I see in https://builds.apache.org/computer/ubuntu-2/load-statistics >> (used >> for the 0.98 build mentionned by Andrew above) that we have a >> configuration >> with 2 executors. >> It means that jenkins tries to run 2 builds in parallel, each of these >> builds will trigger its own set of surefire forks. >> >> iirc, in the past: >> - we were not building on these machines, we were using only the hadoop >> pool of machines >> - these machines were configured with 1 executor >> >> From what I see, there are two sets of machines >> - H*, for hadoop projects. H0 (for example) is configured with a single >> executor. >> - ubuntu*, for everybody: ubuntu2 (for example) is configured with 2 >> executors. >> >> 0.98 and PreCommit-HBASE-Build are configured with: (ubuntu||Hadoop) && >> !jenkins-cloud-4GB && !H11 >> >> So it depends: lucky = H*. Unlucky = ubuntu* >> >> I don't know who changed this, nor why, but may be we should not go to >> ubuntu* machines. Or, if it's possible, we should have a different config >> for these machines. >> >> >> >> On Mon, Jan 19, 2015 at 7:11 PM, Andrew Purtell > > >> wrote: >> >> > The 0.98 build is still showing this problem (latest as of now at >> > https://builds.apache.org/job/hbase-0.98/803), so I went ahead and made >> > the >> > proposed change, but only to the 0.98 builds. I'll let you know if it >> > provides any improvement. >> > >> > >> > On Sun, Jan 18, 2015 at 10:00 AM, Andrew Purtell < >> andrew.purtell@gmail.com >> > > >> > wrote: >> > >> > > Forked VMs are being killed in the 0.98 builds. That suggests >> > > infrastructure issues. >> > > >> > > Having only one test execute in a forked runner does mean the finding >> of >> > a >> > > zombie and thread dumps or other state from the runner will identify >> and >> > > characterize a sick test with no unrelated state mixed in. >> > > >> > > >> > > > On Jan 17, 2015, at 7:43 PM, Stack wrote: >> > > > >> > > > Agree, try anything to get our blues back. We add back the //ism >> after >> > > all >> > > > settles. >> > > > >> > > > Do you think something has changed in INFRA Andy? Is it more >> contended? >> > > Or, >> > > > more likely, is it that we've been committing stuff that has >> > destabilized >> > > > builds? We had a good streak of blue there for a while. It just took >> > some >> > > > work fixing breakage and watching jenkins to make sure breakage >> didn't >> > > > sneak in, but we've lapsed for sure. >> > > > >> > > > St.Ack >> > > > >> > > >> On Sat, Jan 17, 2015 at 9:19 AM, Dima Spivak > > >> > > wrote: >> > > >> >> > > >> Not running tests in parallel will definitely cut down on Surefire >> > > >> flakiness (and in contention that sometimes leads to false >> failures in >> > > >> resource-hungry tests), but it will probably also balloon test run >> > > times to >> > > >> about two hours. Probably worth it in the short term, but we >> > > >> eventually need to do something about some of these heavy tests. >> > > >> >> > > >> -Dima >> > > >> >> > > >> On Friday, January 16, 2015, Andrew Purtell < >> andrew.purtell@gmail.com >> > > >> > > >> wrote: >> > > >> >> > > >>> You might have missed the larger issue Ted. >> > > >>> >> > > >>> >> > > >>>> On Jan 16, 2015, at 4:48 PM, Ted Yu > > > >> > >> > > >>> wrote: >> > > >>>> >> > > >>>> With HBASE-12874, we should get a green build for branch-1.0 >> > > >>>> >> > > >>>> FYI >> > > >>>> >> > > >>>> On Fri, Jan 16, 2015 at 12:20 PM, Andrew Purtell < >> > apurtell@apache.org >> > > >>> > >> > > >>>> wrote: >> > > >>>> >> > > >>>>> See BUILDS-49 tracking issues specifically with 0.98 jobs, but I >> > just >> > > >>>>> noticed trunk, branch-1, and branch-1.0 all failed after I >> checked >> > in >> > > >> a >> > > >>>>> shell doc fix due to a timeout or fork failure. >> > > >>>>> >> > > >>>>> I propose we update all Jenkins jobs to not run tests in >> parallel, >> > > >> i.e. >> > > >>> add >> > > >>>>> "-Dsurefire.firstPartForkCount=1 >> -Dsurefire.secondPartForkCount=1" >> > > >>>>> >> > > >>>>> -- >> > > >>>>> Best regards, >> > > >>>>> >> > > >>>>> - Andy >> > > >>>>> >> > > >>>>> Problems worthy of attack prove their worth by hitting back. - >> Piet >> > > >> Hein >> > > >>>>> (via Tom White) >> > > >> >> > > >> > >> > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) --047d7b3a8192a7c080050d91e89d--