Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 72E9D2004F3 for ; Tue, 15 Aug 2017 17:16:25 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 715C4166CC0; Tue, 15 Aug 2017 15:16:25 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B74DC166CBD for ; Tue, 15 Aug 2017 17:16:24 +0200 (CEST) Received: (qmail 56895 invoked by uid 500); 15 Aug 2017 15:16:21 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 56883 invoked by uid 99); 15 Aug 2017 15:16:21 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Aug 2017 15:16:21 +0000 Received: from mail-qt0-f181.google.com (mail-qt0-f181.google.com [209.85.216.181]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id DDA801A00A2 for ; Tue, 15 Aug 2017 15:16:20 +0000 (UTC) Received: by mail-qt0-f181.google.com with SMTP id a18so6024517qta.0 for ; Tue, 15 Aug 2017 08:16:20 -0700 (PDT) X-Gm-Message-State: AHYfb5jKyBebgmHIoEsxKa/t5hRbCX5MVihvDrBfy6NXBHzKNMxZZd6/ Slp1amnvfCNk9jRtq9imjZJS5DfqymhR X-Received: by 10.237.63.131 with SMTP id s3mr36560229qth.90.1502810178912; Tue, 15 Aug 2017 08:16:18 -0700 (PDT) MIME-Version: 1.0 Received: by 10.55.23.77 with HTTP; Tue, 15 Aug 2017 08:16:18 -0700 (PDT) In-Reply-To: <1082409251.2056128.1502806339267@mail.yahoo.com> References: <796529346.1215093.1502712127413.ref@mail.yahoo.com> <796529346.1215093.1502712127413@mail.yahoo.com> <0A4551F2-289B-438F-AB4D-80FA4B4E2881@apache.org> <1082409251.2056128.1502806339267@mail.yahoo.com> From: Alexey Kuznetsov Date: Tue, 15 Aug 2017 22:16:18 +0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Ignite not friendly for Monitoring To: dev@ignite.apache.org, Alexey Kukushkin Content-Type: multipart/alternative; boundary="001a11434eb8e12c730556cc4243" archived-at: Tue, 15 Aug 2017 15:16:25 -0000 --001a11434eb8e12c730556cc4243 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Alexey, How you are going to deal with distributed nature of Ignite cluster? And how do you propose handle nodes restart / stop? On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin < alexeykukushkin@yahoo.com.invalid> wrote: > Hi Denis, > Monitoring tools simply watch event logs for patterns (regex in case of > unstructured logs like text files). A stable (not changing in new release= s) > event ID identifying specific issue would be such a pattern. > We need to introduce such event IDs according to the principles I > described in my previous mail. > Best regards, Alexey > > > On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda < > dmagda@apache.org> wrote: > > Hello Alexey, > > Thanks for the detailed input. > > Assuming that Ignite supported the suggested events based model, how can > it be integrated with mentioned tools like DynaTrace or Nagios? Is this a= ll > we need? > > =E2=80=94 > Denis > > > On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin .INVALID> wrote: > > > > Igniters, > > While preparing some Ignite materials for Administrators I found Ignite > is not friendly for such a critical DevOps practice as monitoring. > > TL;DRI think Ignite misses structured descriptions of abnormal events > with references to event IDs in the logs not changing as new versions are > released. > > MORE DETAILS > > I call an application =E2=80=9Cmonitoring friendly=E2=80=9D if it allow= s DevOps to: > > 1. immediately receive a notification (email, SMS, etc.) > > 2. understand what a problem is without involving developers > > 3. provide automated recovery action. > > > > Large enterprises do not implement custom solutions. They usually use > tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the > enterprise consistently. All such tools have similar architecture providi= ng > a dashboard showing apps as =E2=80=9Cgreen/yellow/red=E2=80=9D, and numer= ous =E2=80=9Cconnectors=E2=80=9D > to look for events in text logs, ESBs, database tables, etc. > > > > For each app DevOps build a =E2=80=9Chealth model=E2=80=9D - a diagram = displaying the > app=E2=80=99s =E2=80=9Cmanageable=E2=80=9D components and the app boundar= ies. A =E2=80=9Cmanageable=E2=80=9D > component is something that can be started/stopped/configured in isolatio= n. > =E2=80=9CSystem boundary=E2=80=9D is a list of external apps that the mon= itored app > interacts with. > > > > The main attribute of a manageable component is a list of =E2=80=9Coper= ationally > significant events=E2=80=9D. Those are the events that DevOps can do some= thing > with. For example, =E2=80=9Cfailed to connect to cache store=E2=80=9D is = significant, while > =E2=80=9Cuser input validation failed=E2=80=9D is not. > > > > Events shall be as specific as possible so that DevOps do not spend tim= e > for further analysis. For example, a =E2=80=9Cdatabase failure=E2=80=9D e= vent is not good. > There should be =E2=80=9Cdatabase connection failure=E2=80=9D, =E2=80=9Ci= nvalid database schema=E2=80=9D, > =E2=80=9Cdatabase authentication failure=E2=80=9D, etc. events. > > > > =E2=80=9CEvent=E2=80=9D is NOT the same as exception occurred in the co= de. Events > identify specific problem from the DevOps point of view. For example, eve= n > if =E2=80=9Cconnection to cache store failed=E2=80=9D exception might be = thrown from > several places in the code, that is still the same event. On the other > side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout > exceptions might be caught in the same place, those are different events > since MS SQL Server and Oracle are usually different DevOps groups in lar= ge > enterprises! > > > > The operationally significant event IDs must be stable: they must not > change from one release to another. This is like a contract between > developers and DevOps. > > > > This should be the developer=E2=80=99s responsibility to publish and ma= intain a > table with attributes: > > > > - Event ID > > - Severity: Critical (Red) - the system is not operational; Warning > (Yellow) - the system is operational but health is degraded; None - just = an > info. > > - Description: concise but enough for DevOps to act without developer= =E2=80=99s > help > > - Recovery actions: what DevOps shall do to fix the issue without > developer=E2=80=99s help. DevOps might create automated recovery scripts = based on > this information. > > > > For example: > > 10100 - Critical - Could not connect to Zookeeper to discovery nodes - > 1) Open ignite configuration and find zookeeper connection string 2) Make > sure the Zookeeper is running > > 10200 - Warning - Ignite node left the cluster. > > > > Back to Ignite: it looks to me we do not design for operations as > described above. We have no event IDs: our logging is subject to change i= n > new version so that any patterns DevOps might use to detect significant > events would stop working after upgrade. > > > > If I am not the only one how have such concerns then we might open a > ticket to address this. > > > > > > Best regards, Alexey > --=20 Alexey Kuznetsov --001a11434eb8e12c730556cc4243--