ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Kuznetsov <akuznet...@apache.org>
Subject Re: Ignite not friendly for Monitoring
Date Tue, 15 Aug 2017 15:16:18 GMT
Alexey,

How you are going to deal with distributed nature of Ignite cluster?
And how do you propose handle nodes restart / stop?

On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
alexeykukushkin@yahoo.com.invalid> wrote:

> Hi Denis,
> Monitoring tools simply watch event logs for patterns (regex in case of
> unstructured logs like text files). A stable (not changing in new releases)
> event ID identifying specific issue would be such a pattern.
> We need to introduce such event IDs according to the principles I
> described in my previous mail.
> Best regards, Alexey
>
>
> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> dmagda@apache.org> wrote:
>
> Hello Alexey,
>
> Thanks for the detailed input.
>
> Assuming that Ignite supported the suggested events based model, how can
> it be integrated with mentioned tools like DynaTrace or Nagios? Is this all
> we need?
>
> —
> Denis
>
> > On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <alexeykukushkin@yahoo.com
> .INVALID> wrote:
> >
> > Igniters,
> > While preparing some Ignite materials for Administrators I found Ignite
> is not friendly for such a critical DevOps practice as monitoring.
> > TL;DRI think Ignite misses structured descriptions of abnormal events
> with references to event IDs in the logs not changing as new versions are
> released.
> > MORE DETAILS
> > I call an application “monitoring friendly” if it allows DevOps to:
> > 1. immediately receive a notification (email, SMS, etc.)
> > 2. understand what a problem is without involving developers
> > 3. provide automated recovery action.
> >
> > Large enterprises do not implement custom solutions. They usually use
> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> enterprise consistently. All such tools have similar architecture providing
> a dashboard showing apps as “green/yellow/red”, and numerous “connectors”
> to look for events in text logs, ESBs, database tables, etc.
> >
> > For each app DevOps build a “health model” - a diagram displaying the
> app’s “manageable” components and the app boundaries. A “manageable”
> component is something that can be started/stopped/configured in isolation.
> “System boundary” is a list of external apps that the monitored app
> interacts with.
> >
> > The main attribute of a manageable component is a list of “operationally
> significant events”. Those are the events that DevOps can do something
> with. For example, “failed to connect to cache store” is significant, while
> “user input validation failed” is not.
> >
> > Events shall be as specific as possible so that DevOps do not spend time
> for further analysis. For example, a “database failure” event is not good.
> There should be “database connection failure”, “invalid database schema”,
> “database authentication failure”, etc. events.
> >
> > “Event” is NOT the same as exception occurred in the code. Events
> identify specific problem from the DevOps point of view. For example, even
> if “connection to cache store failed” exception might be thrown from
> several places in the code, that is still the same event. On the other
> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
> exceptions might be caught in the same place, those are different events
> since MS SQL Server and Oracle are usually different DevOps groups in large
> enterprises!
> >
> > The operationally significant event IDs must be stable: they must not
> change from one release to another. This is like a contract between
> developers and DevOps.
> >
> > This should be the developer’s responsibility to publish and maintain a
> table with attributes:
> >
> > - Event ID
> > - Severity: Critical (Red) - the system is not operational; Warning
> (Yellow) - the system is operational but health is degraded; None - just an
> info.
> > - Description: concise but enough for DevOps to act without developer’s
> help
> > - Recovery actions: what DevOps shall do to fix the issue without
> developer’s help. DevOps might create automated recovery scripts based on
> this information.
> >
> > For example:
> > 10100 - Critical - Could not connect to Zookeeper to discovery nodes -
> 1) Open ignite configuration and find zookeeper connection string 2) Make
> sure the Zookeeper is running
> > 10200 - Warning - Ignite node left the cluster.
> >
> > Back to Ignite: it looks to me we do not design for operations as
> described above. We have no event IDs: our logging is subject to change in
> new version so that any patterns DevOps might use to detect significant
> events would stop working after upgrade.
> >
> > If I am not the only one how have such concerns then we might open a
> ticket to address this.
> >
> >
> > Best regards, Alexey
>



-- 
Alexey Kuznetsov

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message