ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Kukushkin <alexeykukush...@yahoo.com.INVALID>
Subject Re: Ignite not friendly for Monitoring
Date Tue, 15 Aug 2017 17:56:37 GMT
Hi Alexey,
A nice thing about delegating alerting to 3rd party enterprise systems is that those systems
already deal with lots of things including distributed apps.
What is needed from Ignite is to consistently write to log files (again that means stable
event IDs, proper event granularity, no repetition, documentation). This would be 3rd party
monitoring system's responsibility to monitor log files on all nodes, filter, aggregate, process,
visualize and notify on events.
How a monitoring tool would deal with an event like "node left":
The only thing needed from Ignite is to write an entry like below to log files on all Ignite
servers. In this example 3300 identifies this "node left" event and will never change in the
future even if text description changes:
[2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the cluster
Then we document somewhere on the web that Ignite has event 3300 and it means a node left
the cluster. Maybe provide documentation how to deal with it. Some examples:Oracle Web Cache
events: https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/event.htm#sthref2393MS SQL
Server events: https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx 
That is all for Ignite! Everything else is handled by specific monitoring system configured
by DevOps on the customer side. 
Basing on the Ignite documentation similar to above, DevOps of a company where Ignite is going
to be used will configure their monitoring system to understand Ignite events. Consider the
"node left" event as an example.
- This event is output on every node but DevOps do not want to be notified many times. To
address this, they will build an "Ignite model" where there will be a parent-child dependency
between components "Ignite Cluster" and "Ignite Node". For example, this is how you do it
in Nagios: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/dependencies.html and
this is how you do it in Microsoft SCSM: https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then
DevOps will configure "node left" monitors in SCSM (or a "checks" in Nagios) for parent "Ignite
Cluster" and child "Ignite Service" components. State change (OK -> WARNING) and notification
(email, SMS, whatever) will be configured only for the "Ignite Cluster"'s "node left" monitor.-
Now suppose a node left. The "node left" monitor (that uses log file monitoring plugin) on
"Ignite Node" will detect the event and pass it to the parent. This will trigger "Ignite Cluster"
state change from OK to WARNING and send a notification. No more notification will be sent
unless the "Ignite Cluster" state is reset back to OK, which happens either manually or on
timeout or automatically on "node joined". 
This was just FYI. We, Ignite developers, do not care about how monitoring works - this is
responsibility of customer's DevOps. Our responsibility is consistent event logging.
Thank you!


Best regards, Alexey


On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <akuznetsov@apache.org>
wrote:

Alexey,

How you are going to deal with distributed nature of Ignite cluster?
And how do you propose handle nodes restart / stop?

On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
alexeykukushkin@yahoo.com.invalid> wrote:

> Hi Denis,
> Monitoring tools simply watch event logs for patterns (regex in case of
> unstructured logs like text files). A stable (not changing in new releases)
> event ID identifying specific issue would be such a pattern.
> We need to introduce such event IDs according to the principles I
> described in my previous mail.
> Best regards, Alexey
>
>
> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> dmagda@apache.org> wrote:
>
> Hello Alexey,
>
> Thanks for the detailed input.
>
> Assuming that Ignite supported the suggested events based model, how can
> it be integrated with mentioned tools like DynaTrace or Nagios? Is this all
> we need?
>
> —
> Denis
>
> > On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <alexeykukushkin@yahoo.com
> .INVALID> wrote:
> >
> > Igniters,
> > While preparing some Ignite materials for Administrators I found Ignite
> is not friendly for such a critical DevOps practice as monitoring.
> > TL;DRI think Ignite misses structured descriptions of abnormal events
> with references to event IDs in the logs not changing as new versions are
> released.
> > MORE DETAILS
> > I call an application “monitoring friendly” if it allows DevOps to:
> > 1. immediately receive a notification (email, SMS, etc.)
> > 2. understand what a problem is without involving developers
> > 3. provide automated recovery action.
> >
> > Large enterprises do not implement custom solutions. They usually use
> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> enterprise consistently. All such tools have similar architecture providing
> a dashboard showing apps as “green/yellow/red”, and numerous “connectors”
> to look for events in text logs, ESBs, database tables, etc.
> >
> > For each app DevOps build a “health model” - a diagram displaying the
> app’s “manageable” components and the app boundaries. A “manageable”
> component is something that can be started/stopped/configured in isolation.
> “System boundary” is a list of external apps that the monitored app
> interacts with.
> >
> > The main attribute of a manageable component is a list of “operationally
> significant events”. Those are the events that DevOps can do something
> with. For example, “failed to connect to cache store” is significant, while
> “user input validation failed” is not.
> >
> > Events shall be as specific as possible so that DevOps do not spend time
> for further analysis. For example, a “database failure” event is not good.
> There should be “database connection failure”, “invalid database schema”,
> “database authentication failure”, etc. events.
> >
> > “Event” is NOT the same as exception occurred in the code. Events
> identify specific problem from the DevOps point of view. For example, even
> if “connection to cache store failed” exception might be thrown from
> several places in the code, that is still the same event. On the other
> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
> exceptions might be caught in the same place, those are different events
> since MS SQL Server and Oracle are usually different DevOps groups in large
> enterprises!
> >
> > The operationally significant event IDs must be stable: they must not
> change from one release to another. This is like a contract between
> developers and DevOps.
> >
> > This should be the developer’s responsibility to publish and maintain a
> table with attributes:
> >
> > - Event ID
> > - Severity: Critical (Red) - the system is not operational; Warning
> (Yellow) - the system is operational but health is degraded; None - just an
> info.
> > - Description: concise but enough for DevOps to act without developer’s
> help
> > - Recovery actions: what DevOps shall do to fix the issue without
> developer’s help. DevOps might create automated recovery scripts based on
> this information.
> >
> > For example:
> > 10100 - Critical - Could not connect to Zookeeper to discovery nodes -
> 1) Open ignite configuration and find zookeeper connection string 2) Make
> sure the Zookeeper is running
> > 10200 - Warning - Ignite node left the cluster.
> >
> > Back to Ignite: it looks to me we do not design for operations as
> described above. We have no event IDs: our logging is subject to change in
> new version so that any patterns DevOps might use to detect significant
> events would stop working after upgrade.
> >
> > If I am not the only one how have such concerns then we might open a
> ticket to address this.
> >
> >
> > Best regards, Alexey
>



-- 
Alexey Kuznetsov
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message