From dev-return-1842-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Wed Jan 10 21:38:31 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id EE46818072F for ; Wed, 10 Jan 2018 21:38:30 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id DE280160C2E; Wed, 10 Jan 2018 20:38:30 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2D9C0160C1E for ; Wed, 10 Jan 2018 21:38:30 +0100 (CET) Received: (qmail 93200 invoked by uid 500); 10 Jan 2018 20:38:29 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 93188 invoked by uid 99); 10 Jan 2018 20:38:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jan 2018 20:38:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 8E7F9C0CB9 for ; Wed, 10 Jan 2018 20:38:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=googlemail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id YePU6jaBJf6I for ; Wed, 10 Jan 2018 20:38:27 +0000 (UTC) Received: from mail-lf0-f49.google.com (mail-lf0-f49.google.com [209.85.215.49]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 561CC5F1B9 for ; Wed, 10 Jan 2018 20:38:27 +0000 (UTC) Received: by mail-lf0-f49.google.com with SMTP id j143so322862lfg.0 for ; Wed, 10 Jan 2018 12:38:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=yli2T1Sa9bwl6yJuak3SsjlvF3yHMD6JYl1fvmBvpH4=; b=EnKP5lUTVKI7fF1jsn4qfW5LF74dk3jfPrqzDv+eTJpfRmrPTM6kGtimJlbZdiwHfp 8AianQZtNpV9rcq8YEH5bXiIMRH8+cFovIHARDxT9MR6DthrAst7JPJezg2WGIuXgJ3F cytYbzgqsJMxzmwIByRHdFJyfu7ibwx5WqPFOUH75olbgF9CuQ7DB8LkU2TEp+Ge7km2 2aMrvSFJ0aZoJcGQQGkX3y7tQ4mdsT6/8yRL+F5RIqFPm4H7yHIVehKu9gqI6DuEnyXa 3xmc2L76fniuuPNq6d/FBHQ2v+gD19EMT5bTP2uIhzuHRaaNQYv0M4tCr59B7R3FqyYZ YAZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=yli2T1Sa9bwl6yJuak3SsjlvF3yHMD6JYl1fvmBvpH4=; b=CrE1TlBbFwXI+uYlLTCvDk6cb3/W5kg7vZlKg9bFXCvC6I+97CIE0CVajS07nXsfND nXsXL6iHtawJV7S+rrB7vtf59iZv+jVCH/vidVjifCapannosznrZag2ZuHER3yZn9KH ycEQGWoE8Hce1OqlDftPXXQlbCELok797XlY2Z1v5Wqo6esuBGVRTCIHCozUeKe+HVNN DETM0NYPbkEyylFakoXMf+Zb/jlkJP7/yE1bh/IX5CAviCy6NLKd3eAHZCPKRV8CKfpN GhLtWPs3M/56dNDISKq/YcC25GSvL2DNC/SfhfQ422pJaresXQU+DlgSXR7ZQ+RvgrZB UuaQ== X-Gm-Message-State: AKwxytc6IxrGQiQ+ludE+/ROIl90mvTKB2ZPO4CO4ft5rMxWtYIkBtW6 82Gz+H9SAErO8mhmF7ysbYfZhYGNqUpFhUGxaZim3g== X-Google-Smtp-Source: ACJfBouMQ52Dqq09VXq82XLjuzLUcNNUJoeHDK8j2rUR1RsBQIkHL9Cbd/zUFmVpnfjdCLg/c8jtPd9hHvgayypLTgU= X-Received: by 10.25.208.73 with SMTP id h70mr10098014lfg.95.1515616704982; Wed, 10 Jan 2018 12:38:24 -0800 (PST) MIME-Version: 1.0 Received: by 10.25.19.168 with HTTP; Wed, 10 Jan 2018 12:37:44 -0800 (PST) In-Reply-To: References: From: Marco de Abreu Date: Wed, 10 Jan 2018 21:37:44 +0100 Message-ID: Subject: Re: CI: nvml: Driver/library version mismatch To: dev@mxnet.incubator.apache.org Content-Type: multipart/alternative; boundary="001a1141912a50f5470562720321" --001a1141912a50f5470562720321 Content-Type: text/plain; charset="UTF-8" Small update to give you some background: We have been able to get the CI back to a stable state - thanks to Pedro and Kellen! Reason for this issue was a required security update related to the Spectre-vulnerability https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-384/+bug/1741807. This update was not compatible to the installed nvidia-docker version and thus broke our CI. I have installed all updates, validated that nvidia-docker is working again and started a new set of mxnet-linux-gpu-slaves. If any issues arise, please don't hesitate to drop a quick message on this thread. -Marco On Wed, Jan 10, 2018 at 6:45 PM, Marco de Abreu < marco.g.abreu@googlemail.com> wrote: > Hello, > > recently, Nvidia released a new version of their cuda and gpu drivers for > Ubuntu16.04. This updated has been applied automatically while the slaves > were running, which caused the nvidia-docker-daemon to disconnect. Due to > the update requiring a restart, the daemon was not able to reconnect and > caused the error 'nvml: Driver/library version mismatch'. We have restarted > all slaves to apply the update. > > In future, we plan to explicitly disallow automated updates of all > nvidia-related drivers. > > Best regards, > Marco > > --001a1141912a50f5470562720321--