Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ignite.apache.org
MIME-Version: 1.0
In-Reply-To: <1082409251.2056128.1502806339267@mail.yahoo.com>
References: <796529346.1215093.1502712127413.ref@mail.yahoo.com>
 <796529346.1215093.1502712127413@mail.yahoo.com> <0A4551F2-289B-438F-AB4D-80FA4B4E2881@apache.org>
 <1082409251.2056128.1502806339267@mail.yahoo.com>
From: Alexey Kuznetsov <akuznetsov@apache.org>
Date: Tue, 15 Aug 2017 22:16:18 +0700
Message-ID: <CALH+G9ojBMMb13fNB+9JuK2y=ovWBbApK04NJmXJ-gKLWpDUbg@mail.gmail.com>
Subject: Re: Ignite not friendly for Monitoring
To: dev@ignite.apache.org, Alexey Kukushkin <alexeykukushkin@yahoo.com>
Content-Type: multipart/alternative; boundary="001a11434eb8e12c730556cc4243"
archived-at: Tue, 15 Aug 2017 15:16:25 -0000

--001a11434eb8e12c730556cc4243
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Alexey,

How you are going to deal with distributed nature of Ignite cluster?
And how do you propose handle nodes restart / stop?

On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
alexeykukushkin@yahoo.com.invalid> wrote:

> Hi Denis,
> Monitoring tools simply watch event logs for patterns (regex in case of
> unstructured logs like text files). A stable (not changing in new release=
s)
> event ID identifying specific issue would be such a pattern.
> We need to introduce such event IDs according to the principles I
> described in my previous mail.
> Best regards, Alexey
>
>
> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> dmagda@apache.org> wrote:
>
> Hello Alexey,
>
> Thanks for the detailed input.
>
> Assuming that Ignite supported the suggested events based model, how can
> it be integrated with mentioned tools like DynaTrace or Nagios? Is this a=
ll
> we need?
>
> =E2=80=94
> Denis
>
> > On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <alexeykukushkin@yahoo.co=
m
> .INVALID> wrote:
> >
> > Igniters,
> > While preparing some Ignite materials for Administrators I found Ignite
> is not friendly for such a critical DevOps practice as monitoring.
> > TL;DRI think Ignite misses structured descriptions of abnormal events
> with references to event IDs in the logs not changing as new versions are
> released.
> > MORE DETAILS
> > I call an application =E2=80=9Cmonitoring friendly=E2=80=9D if it allow=
s DevOps to:
> > 1. immediately receive a notification (email, SMS, etc.)
> > 2. understand what a problem is without involving developers
> > 3. provide automated recovery action.
> >
> > Large enterprises do not implement custom solutions. They usually use
> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> enterprise consistently. All such tools have similar architecture providi=
ng
> a dashboard showing apps as =E2=80=9Cgreen/yellow/red=E2=80=9D, and numer=
ous =E2=80=9Cconnectors=E2=80=9D
> to look for events in text logs, ESBs, database tables, etc.
> >
> > For each app DevOps build a =E2=80=9Chealth model=E2=80=9D - a diagram =
displaying the
> app=E2=80=99s =E2=80=9Cmanageable=E2=80=9D components and the app boundar=
ies. A =E2=80=9Cmanageable=E2=80=9D
> component is something that can be started/stopped/configured in isolatio=
n.
> =E2=80=9CSystem boundary=E2=80=9D is a list of external apps that the mon=
itored app
> interacts with.
> >
> > The main attribute of a manageable component is a list of =E2=80=9Coper=
ationally
> significant events=E2=80=9D. Those are the events that DevOps can do some=
thing
> with. For example, =E2=80=9Cfailed to connect to cache store=E2=80=9D is =
significant, while
> =E2=80=9Cuser input validation failed=E2=80=9D is not.
> >
> > Events shall be as specific as possible so that DevOps do not spend tim=
e
> for further analysis. For example, a =E2=80=9Cdatabase failure=E2=80=9D e=
vent is not good.
> There should be =E2=80=9Cdatabase connection failure=E2=80=9D, =E2=80=9Ci=
nvalid database schema=E2=80=9D,
> =E2=80=9Cdatabase authentication failure=E2=80=9D, etc. events.
> >
> > =E2=80=9CEvent=E2=80=9D is NOT the same as exception occurred in the co=
de. Events
> identify specific problem from the DevOps point of view. For example, eve=
n
> if =E2=80=9Cconnection to cache store failed=E2=80=9D exception might be =
thrown from
> several places in the code, that is still the same event. On the other
> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
> exceptions might be caught in the same place, those are different events
> since MS SQL Server and Oracle are usually different DevOps groups in lar=
ge
> enterprises!
> >
> > The operationally significant event IDs must be stable: they must not
> change from one release to another. This is like a contract between
> developers and DevOps.
> >
> > This should be the developer=E2=80=99s responsibility to publish and ma=
intain a
> table with attributes:
> >
> > - Event ID
> > - Severity: Critical (Red) - the system is not operational; Warning
> (Yellow) - the system is operational but health is degraded; None - just =
an
> info.
> > - Description: concise but enough for DevOps to act without developer=
=E2=80=99s
> help
> > - Recovery actions: what DevOps shall do to fix the issue without
> developer=E2=80=99s help. DevOps might create automated recovery scripts =
based on
> this information.
> >
> > For example:
> > 10100 - Critical - Could not connect to Zookeeper to discovery nodes -
> 1) Open ignite configuration and find zookeeper connection string 2) Make
> sure the Zookeeper is running
> > 10200 - Warning - Ignite node left the cluster.
> >
> > Back to Ignite: it looks to me we do not design for operations as
> described above. We have no event IDs: our logging is subject to change i=
n
> new version so that any patterns DevOps might use to detect significant
> events would stop working after upgrade.
> >
> > If I am not the only one how have such concerns then we might open a
> ticket to address this.
> >
> >
> > Best regards, Alexey
>


--=20
Alexey Kuznetsov

--001a11434eb8e12c730556cc4243--