Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 962F2200B78 for ; Fri, 19 Aug 2016 08:21:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 931D9160AB7; Fri, 19 Aug 2016 06:21:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8BF0F160AAE for ; Fri, 19 Aug 2016 08:21:06 +0200 (CEST) Received: (qmail 21233 invoked by uid 500); 19 Aug 2016 06:21:05 -0000 Mailing-List: contact infrastructure-dev-help@apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: infrastructure-dev@apache.org Delivered-To: mailing list infrastructure-dev@apache.org Received: (qmail 21217 invoked by uid 99); 19 Aug 2016 06:21:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Aug 2016 06:21:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id BCDEFC034D for ; Fri, 19 Aug 2016 06:21:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.592 X-Spam-Level: ** X-Spam-Status: No, score=2.592 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_WEB=0.614, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=cwareitservice.onmicrosoft.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id fQK31yeAI84y for ; Fri, 19 Aug 2016 06:21:01 +0000 (UTC) Received: from EUR02-AM5-obe.outbound.protection.outlook.com (mail-eopbgr00113.outbound.protection.outlook.com [40.107.0.113]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id BE7745F23D for ; Fri, 19 Aug 2016 06:21:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=CWareITService.onmicrosoft.com; s=selector1-cware-de0c; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=5Vl7heSRLWuSyGUA1wBRceEZIS+ACIMt50/uCXNmn3I=; b=Gf0nCJ6f2i/CZ/1pisPU/M8QeMa18imjiMh7a3rVkW1KTxnDnPv6Pugi3r7J7rZnGW9wBpXVk72f+rkexNRpD+1q1Xmhk7W2sa7PCqOqxGtmDtQdgdkSR1hsIYYzqGxIxLFano4Nr7JacwUuaPs170gTQS38GMo0yFiElHU1IlY= Received: from HE1PR0501MB2428.eurprd05.prod.outlook.com (10.168.126.8) by HE1PR0501MB2428.eurprd05.prod.outlook.com (10.168.126.8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.544.4; Fri, 19 Aug 2016 06:20:51 +0000 Received: from HE1PR0501MB2428.eurprd05.prod.outlook.com ([10.168.126.8]) by HE1PR0501MB2428.eurprd05.prod.outlook.com ([10.168.126.8]) with mapi id 15.01.0544.024; Fri, 19 Aug 2016 06:20:51 +0000 From: Christofer Dutz To: "infrastructure-dev@apache.org" CC: "mirko.novakovic@codecentric.de" Subject: AW: Tool proposal for helping run and monitor the ASF Infra Services Thread-Topic: Tool proposal for helping run and monitor the ASF Infra Services Thread-Index: AQHR+TCG0YI6El1LpEOMekMbD6PDxaBPu5iAgAAVsWo= Date: Fri, 19 Aug 2016 06:20:51 +0000 Message-ID: References: ,<81BD2FEA-413E-489B-BCA6-6F5C79EC8178@apache.org> In-Reply-To: <81BD2FEA-413E-489B-BCA6-6F5C79EC8178@apache.org> Accept-Language: de-DE, en-US Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=christofer.dutz@c-ware.de; x-originating-ip: [80.187.107.130] x-ms-office365-filtering-correlation-id: 05411c10-14f4-4a11-2602-08d3c7f8f82a x-microsoft-exchange-diagnostics: 1;HE1PR0501MB2428;6:LnUhScemLd3RsJGkTgqkZoq+hkAwVwPGcC+OLUlEFWfICaxYz08Fngvv2W8dzJlEbUmCYD37wgHco5wp9Tnu+CEFOjc3YDnl75kGL8aJL8cu91mNW83+458pzkt+WetaJz77p+a7phS9MQdrIJqot/3MG9yagNY9ml9ex0KPd2ybjK9ozaHVi5Ht5tiAA7aysX9/wi/OG/Jg/HxQirr5dGvrBUHLnR5xv2fSnQUeth9DOrDTq6LV+Y1Vj8CGXRcR89l1HGX5ViR5eAhabmuKEUGD+3hpDCNpM6qhMOoPvbyg2LSjoazP45mMAbWTBa8y;5:GrcS84rp3h4cEULn3eHRRXOLZbfChylA4a3Xt2r0gt3En+PhcTt6T6DUQutg6nJvaeesGfxQaDF4UzvSZwXo3LiuZEESUY0b5P2Ps8fCVcBeGYpxL23eK9jjGXOenFk9xoiVw6j/WnZafwAXu+HjuQ==;24:hU+a1rA0ayZ5PaaXUtqmmMxr3E6+0hQ/9DxieC1lDOCzOsGdHgtZLbWBaeTzMEvG4hc6gvLXrDvIDz0BGv84cyGWzjuKt4zx/c8Rq6/rErI=;7:PVS0/f75YqiuARDiZ0BdT1w3BaprU5cKhuI7jg/ic3yeKo1LPD6KNQHpH+pvioGCC147oQ02FJ/NHWrs3CG5Pa1fEMqQMEc48ENkIouIH/ccRQl2JMsQRT9o6HuoDwYFNnZUfq9HgwBl4+Bg6AoFqziKbbdazxMW6tgwzTy4H73nlSnZNnaYka57hQN5KvnQCQezGpQclX+MGXsgSEacVOeNxOcAfEGYfCMyEzbRqWAUxvKXdbGJIX3eTY5Ob0F6 x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:HE1PR0501MB2428; x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(158342451672863)(20558992708506)(72170088055959)(209352067349851)(192374486261705); x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(6040176)(601004)(2401047)(8121501046)(5005006)(3002001)(10201501046)(6042046)(6043046);SRVR:HE1PR0501MB2428;BCL:0;PCL:0;RULEID:;SRVR:HE1PR0501MB2428; x-forefront-prvs: 0039C6E5C5 x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(7916002)(377454003)(189002)(199003)(51914003)(52044002)(66654002)(24454002)(74482002)(122556002)(75402003)(2351001)(81156014)(4326007)(97736004)(95246002)(101416001)(105586002)(7906003)(8676002)(575784001)(106116001)(87936001)(561944003)(106356001)(7846002)(8936002)(551544002)(19617315012)(7736002)(229853001)(33646002)(19580395003)(2950100001)(15975445007)(86362001)(9686002)(81166006)(189998001)(110136002)(77096005)(10400500002)(54356999)(63666004)(3660700001)(19580405001)(5002640100001)(68736007)(6116002)(11100500001)(2906002)(16236675004)(50986999)(3846002)(586003)(3280700002)(76176999)(19625215002)(92566002)(66066001)(2900100001)(5640700001)(102836003)(2501003)(51650200001)(290074003);DIR:OUT;SFP:1102;SCL:1;SRVR:HE1PR0501MB2428;H:HE1PR0501MB2428.eurprd05.prod.outlook.com;FPR:;SPF:None;PTR:InfoNoRecords;MX:1;A:1;LANG:en; received-spf: None (protection.outlook.com: c-ware.de does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: multipart/alternative; boundary="_000_v18ets5rb2230ba1rueinn971471587650621emailandroidcom_" MIME-Version: 1.0 X-OriginatorOrg: c-ware.de X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Aug 2016 06:20:51.1200 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 9d387546-1437-4b89-846c-691d64a7e74d X-MS-Exchange-Transport-CrossTenantHeadersStamped: HE1PR0501MB2428 archived-at: Fri, 19 Aug 2016 06:21:07 -0000 --_000_v18ets5rb2230ba1rueinn971471587650621emailandroidcom_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Hi Chris, I knew that someone asked exactly the "how does it compare to datadog" ques= tion somewhere. Here's the link to that mail thread https://news.ycombinato= r.com/item?id=3D12147219 And I can confirm the shortcomings of the time series approach, cause in je= nkins, I'd say about 70% of recent failures of flex builds were due to time= outs when uploading Maven artifacts to nexus. The current solution doesn't = seem to detect that. Not only that I couldn't see any hipchat notifications= . The infra guys always had to start looking for the real reason of the tim= eouts as nexus wasn't having any problems at all. And I really live the feature of tracking down the response time for one se= rvice back to other servers to find it the real reason for a system being s= low (have a look in the presentation for this. There's a great slide on thi= s) Chris Von meinem Samsung Galaxy Smartphone gesendet. -------- Urspr=FCngliche Nachricht -------- Von: Chris Lambertus Datum: 19.08.16 07:03 (GMT+01:00) An: infrastructure-dev@apache.org, Christofer Dutz Cc: mirko.novakovic@codecentric.de Betreff: Re: Tool proposal for helping run and monitor the ASF Infra Servic= es Hiya Chris, Thanks for the info and the legwork on this. We currently use DataDog, whic= h is very similar to what Instana appears to provide =97 an agent-based mon= itoring solution that gives us that kind of look into our infra. We also ha= ve a number of internal tools that report on various goings-on as well. You= might see some of this in #asfinfra on hipchat from SNMP2HipChat. DataDog = also reports various problems there, as does our monitoring via PingMyBox. Since you=92re not root@, you may not see some of the stuff that we see, bu= t I think by and large, the majority of the monitors do direct to #asfinfra= . Have you noticed gaps in the monitoring? Since we moved to DataDog, we=92= ve been quite happy with the resolution and metrics we=92ve been able to ge= t. It=92s been on my back burner for awhile to expose some of our DD dashbo= ards as public, but for right now it=92s somewhat limited access. In the in= terests of transparency (but not at the expense of security,) I=92d be happ= y to work with you to expose more of this, and I=92m happy to address any q= uestions or concerns about shortcomings in our monitoring. Many thanks to Instana for offering the ASF free services! I=92d definitely= like to hear more about what they might be able to offer on top of what we= already get from DataDog. I=92ll take a look at the info you sent out. Ple= ase feel free to follow up with me directly, either via email or hipchat. Cheers, -Chris > On Aug 18, 2016, at 2:13 AM, Christofer Dutz = wrote: > > Hi, > > > > I have been on the Infra Hipchat for a few weeks now while trying to migr= ate the Flex project to Maven and back to the ASF Infra build system. Thank= s for your support in this and even more thanks for the trust in granting m= e access and Admin rights on the windows1 build agent. > > > > In the chat I observed the daily work of you guys, having to maintain qui= te a zoo of all sorts of different systems on different platforms. Some pro= blems you were having seem quite easy to track down ... if the hard disk is= full, you clean up. But not all problems are that easy to track down. Thin= king of the problems with repository.apache.org ... here the cause was the = proxy being flooded with connections (I think this was the case) ... regula= r restarts of this helped temporarily, but I don't think that helps on the = long term as no one had an idea why those connections were hanging there in= the first place. > > > > A few years ago the company I work for - codecentric - have founded a com= pany called Instana. They are developing an agent based system for monitori= ng IT infrastructure. In contrast to most established solutions, they use m= achine learning strategies to analyze the root cause for problems. While yo= u can probably achieve similar results with normal tools, the problem is th= at you need a very detailed domain knowledge to do so and in a regularly ch= anging environment you need to continuously keep adjusting your metrics. In= stana does this automatically. I think you can imagine how tricky it is to = follow the root cause for bad response times through a network of interconn= ected services. > > > > Investing almost all of my free time (and a lot of my paid time) for Apac= he, noticing a lot of the problems you have to deal with every day, I asked= Instana if they would be willing to provide their service to the ASF for f= ree and they agreed and immediately setup a dedicated instance. > > > > I wanted to try the thing out as I would prefer to grab a few beers with = you at ApcheCon in Cevillia and not get punched in the face for recommendin= g something bad ;-) ... so I tried this on my private Server playground. I = unpacked and started the agent and the host appeared on the web console and= reported the problems it was having (ones I didn't even know about) as wel= l as other systems it communicates with ... as soon as I added agents on th= ese machines the analytics started doing their work across system and I bui= lt up a map view of my services and their correlation. So it's really a sys= tem that needs almost no configuration at all :-) > > > I uploaded the internal product presentation here: https://public.centerd= evice.de/1a9dc4ed-515e-482e-9fd6-6d60a5562598 (please don't share this outs= ide of the ASF) > > Please use the password: 4p4cheR0cks (I'll remove that document in about = two weeks) > > > By the way ... the screenshots in the presentation are real ... I was ama= zed of seeing a 3D web UI in production for the first time ;-) > > > > So if there is any interest in this offer, I would be more than happy to = provide credentials to you and assist you in getting started, so you could = easily try it out. The guys at Instana would also be delighted to give you = guys an online demo and answer any questions you might be having. Feel free= to conatact Mirco directly for this: mirko.novakovic@codecentric.de > > > > Chris --_000_v18ets5rb2230ba1rueinn971471587650621emailandroidcom_--