From dev-return-2868-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Wed May 16 17:23:26 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 0D31C180669 for ; Wed, 16 May 2018 17:23:25 +0200 (CEST) Received: (qmail 30652 invoked by uid 500); 16 May 2018 15:23:25 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 30634 invoked by uid 99); 16 May 2018 15:23:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 May 2018 15:23:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 0274AC00CD for ; Wed, 16 May 2018 15:23:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=googlemail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id WW5wnK8ObPAx for ; Wed, 16 May 2018 15:23:21 +0000 (UTC) Received: from mail-lf0-f52.google.com (mail-lf0-f52.google.com [209.85.215.52]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 492395F418 for ; Wed, 16 May 2018 15:23:21 +0000 (UTC) Received: by mail-lf0-f52.google.com with SMTP id i11-v6so14937lfb.7 for ; Wed, 16 May 2018 08:23:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=FKCnpRa3rpFqc+0i/V7M39a/flvA4mMYPFmgEuiknA0=; b=Zfe2ygv8VluHf7/JxAyFdh6lWPHHhYfu7AGCEck1VNcpEkz09yWqnW+XSpihb42Vnk QwIFDeAom94+N2YuN770AXVefAhdXOjNOq+N2myx5DW38Hrrnpm/fY6qcesUnKwX3r9f nSIcQuT/4uhS5ElhDYh2Q8/aJ4udll9tr6/8sUR+SoxxWDhotg0IZsVPceGa33C7QJpC Y7sbNsftnqkecoZ0C5GmSqUmi2KfB5MjjSMcJEjWr3n/5DlI8RytdRpHenta9WJ7dVSF 4YbahxlhhwJTY8eFmlt3PLlLDNhkNHnZJ2x9HJta7faMzPhjUKq0wasKLD7/hGnGg/8r 5Y0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=FKCnpRa3rpFqc+0i/V7M39a/flvA4mMYPFmgEuiknA0=; b=BL0qp9A8gIGrFyP1IBhboFhTk8aTz0OF3FCy4Abc8rIAKGGePFCPpi9TLfZf90quLa B6+/eZ8HIhN+G2JlHZyZtQ6WTW1phbeMr9CCHRu2HJY5QUxX129VjGLmLw47eNdAtorb Akoblmwt0Itr16MGU9aSeLP/s9vm6X0hgdmlp5KS25SiHxATOvvhuDJJqCTrJQRS0gpt HBpDHF5HP6OFO006KJrNovE8mldQ1v+zQSXv9wQJQwT9flPSPHpjSDbP11W2PPMRNtAA erNE2WEm4owQBCD88nIMU1JsaJTz76S4Cf7+HKczhH8XlOeymzMUfVoZLeSSb5fFS1T2 0ebw== X-Gm-Message-State: ALKqPwfrTAkmT2/z1oKc+qkEE/cpzzrVs6+CxxDYSd8jYOV0g8RVNwQP UFkZXFtBxxu0vNJCywUt6a1b1xH4NaQeCEFULyw= X-Google-Smtp-Source: AB8JxZr5nQaD8fH1yFi17sB5lNeNS8inzGYIDAvdSMI8sIBBUa0P7GNr4MVhS5WcR2uFTzVBH5/uaQt30GwYWwUw/io= X-Received: by 2002:a2e:638f:: with SMTP id s15-v6mr918442lje.78.1526484199576; Wed, 16 May 2018 08:23:19 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a19:80c7:0:0:0:0:0 with HTTP; Wed, 16 May 2018 08:22:38 -0700 (PDT) In-Reply-To: References: From: Marco de Abreu Date: Wed, 16 May 2018 17:22:38 +0200 Message-ID: Subject: Re: Auto scaling for MXNet CI To: dev@mxnet.incubator.apache.org Content-Type: multipart/alternative; boundary="000000000000789896056c544c3a" --000000000000789896056c544c3a Content-Type: text/plain; charset="UTF-8" Thanks a lot! The following numbers are based on our experience in the test environment. Best case: ~1:50h (unchanged) (0:01 + 0:38 + 0:39 + 0:33 + 0:03) - conditions: No instances have to be provisioned and caches are primed Average case: 2:10h (1:50h + 0:10 for instance startup + 0:10 for cache loading) - conditions: Windows instances are available (they get turned off less frequently), Ubuntu instances have to be provisioned and cache no present Worst case: 3:06h (1:50h + 0:02 + 0:50 + 0:20 + 0:02 + 0:02) - conditions: no available instances The bottleneck for the worst case is caused by the Windows instances. They take about 20 minutes to start and the unprimed MSVC cache results in about 30 minutes increased build times. To balance this out, we're driving a less aggressive downscaling policy for Windows and use increased buffers. At the same time, we're currently working on persistent build caches. An additional option could be reserved instances. We will observe the situation and then assess the required next steps. For now, we want to make sure everything is running stable and no builds are getting interrupted. Best regards, Marco On Wed, May 16, 2018 at 3:47 AM, Thomas DELTEIL wrote: > Great news :) thanks Marco! > > On Tue, May 15, 2018, 18:29 Steffen Rochel > wrote: > > > Thanks Marco, good step forward. > > What is the expected, typical and worst case TAT time for PR checks? > > > > Steffen > > > > On Tue, May 15, 2018 at 10:49 AM Marco de Abreu < > > marco.g.abreu@googlemail.com> wrote: > > > > > Hello, > > > > > > I'd like to announce the deployment of auto scaling for our CI system > > (see > > > [1] for reference, setup documentation at [2]) for today at 11:00PM PST > > > 05/15/18. I expect no downtime since these changes are happening > outside > > of > > > Jenkins. > > > > > > This system will increase the flexibility of our system to be able to > > > handle the increasing load, being a result of the growing number of > great > > > contributions! In future, our CI will automatically adapt to the > current > > > load and will thus support big tasks like the to-be-migrated nightly > > tests > > > or increased load before a release. Additionally, we're now able to > > provide > > > scalable p3.2xlarge instances and have the possibility to add new > > instance > > > types without much effort. > > > > > > In future, you will see that new slaves are being started up as the > queue > > > grows and stopped if they go into idle. Your tasks will automatically > be > > > picked up and our system makes sure every PR gets processes as fast as > > > possible. > > > > > > If you encounter any issues in the next week, please don't hesitate to > > > reach out to me. I'm looking forward to everyones feedback! > > > > > > Best regards, > > > Marco > > > > > > > > > [1]: > > > > > https://cwiki.apache.org/confluence/display/MXNET/ > Proposal%3A+Auto+Scaling > > > [2]: https://cwiki.apache.org/confluence/display/MXNET/Setup > > > > > > --000000000000789896056c544c3a--