ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: Ignite not friendly for Monitoring
Date Mon, 28 Aug 2017 08:22:38 GMT
IGNITE-5620 is about error codes thrown from drivers. This is completely
different story, as every driver has specification with it's own specific
error codes. There is no common denominator.

On Thu, Aug 17, 2017 at 11:10 PM, Denis Magda <dmagda@apache.org> wrote:

> Vladimir,
>
> I would disagree. In IGNITE-5620 we’re going to introduce some constant
> error codes and prepare a sheet that will elaborate on every error. That’s
> a part of bigger endeavor when the whole platform should be covered by
> special unique IDs for errors, warning and events.
>
> Now, we need to agree at least on the IDs range for SQL.
>
> —
> Denis
>
> > On Aug 15, 2017, at 11:10 PM, Vladimir Ozerov <vozerov@gridgain.com>
> wrote:
> >
> > Denis,
> >
> > IGNITE-5620 is completely different thing. Let's do not mix cluster
> > monitoring and parser errors.
> >
> > ср, 16 авг. 2017 г. в 2:57, Denis Magda <dmagda@apache.org>:
> >
> >> Alexey,
> >>
> >> Didn’t know that such an improvement as consistent IDs for errors and
> >> events can be used as an integration point with the DevOps tools. Thanks
> >> for sharing your experience with us.
> >>
> >> Would you step in as a architect for this task and make out a JIRA
> ticket
> >> with all the required information.
> >>
> >> In general, we’ve already planned to do something around this starting
> >> with SQL:
> >> https://issues.apache.org/jira/browse/IGNITE-5620 <
> >> https://issues.apache.org/jira/browse/IGNITE-5620>
> >>
> >> It makes sense to consider your input before the work on IGNITE-5620 is
> >> started.
> >>
> >> —
> >> Denis
> >>
> >>> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
> >> alexeykukushkin@yahoo.com.INVALID> wrote:
> >>>
> >>> Hi Alexey,
> >>> A nice thing about delegating alerting to 3rd party enterprise systems
> >> is that those systems already deal with lots of things including
> >> distributed apps.
> >>> What is needed from Ignite is to consistently write to log files (again
> >> that means stable event IDs, proper event granularity, no repetition,
> >> documentation). This would be 3rd party monitoring system's
> responsibility
> >> to monitor log files on all nodes, filter, aggregate, process, visualize
> >> and notify on events.
> >>> How a monitoring tool would deal with an event like "node left":
> >>> The only thing needed from Ignite is to write an entry like below to
> log
> >> files on all Ignite servers. In this example 3300 identifies this "node
> >> left" event and will never change in the future even if text description
> >> changes:
> >>> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the
> >> cluster
> >>> Then we document somewhere on the web that Ignite has event 3300 and it
> >> means a node left the cluster. Maybe provide documentation how to deal
> with
> >> it. Some examples:Oracle Web Cache events:
> >> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/
> event.htm#sthref2393MS
> >> SQL Server events:
> >> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
> >>> That is all for Ignite! Everything else is handled by specific
> >> monitoring system configured by DevOps on the customer side.
> >>> Basing on the Ignite documentation similar to above, DevOps of a
> company
> >> where Ignite is going to be used will configure their monitoring system
> to
> >> understand Ignite events. Consider the "node left" event as an example.
> >>> - This event is output on every node but DevOps do not want to be
> >> notified many times. To address this, they will build an "Ignite model"
> >> where there will be a parent-child dependency between components "Ignite
> >> Cluster" and "Ignite Node". For example, this is how you do it in
> Nagios:
> >> https://assets.nagios.com/downloads/nagioscore/docs/
> nagioscore/4/en/dependencies.html
> >> and this is how you do it in Microsoft SCSM:
> >> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then
> >> DevOps will configure "node left" monitors in SCSM (or a "checks" in
> >> Nagios) for parent "Ignite Cluster" and child "Ignite Service"
> components.
> >> State change (OK -> WARNING) and notification (email, SMS, whatever)
> will
> >> be configured only for the "Ignite Cluster"'s "node left" monitor.- Now
> >> suppose a node left. The "node left" monitor (that uses log file
> monitoring
> >> plugin) on "Ignite Node" will detect the event and pass it to the
> parent.
> >> This will trigger "Ignite Cluster" state change from OK to WARNING and
> send
> >> a notification. No more notification will be sent unless the "Ignite
> >> Cluster" state is reset back to OK, which happens either manually or on
> >> timeout or automatically on "node joined".
> >>> This was just FYI. We, Ignite developers, do not care about how
> >> monitoring works - this is responsibility of customer's DevOps. Our
> >> responsibility is consistent event logging.
> >>> Thank you!
> >>>
> >>>
> >>> Best regards, Alexey
> >>>
> >>>
> >>> On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <
> >> akuznetsov@apache.org> wrote:
> >>>
> >>> Alexey,
> >>>
> >>> How you are going to deal with distributed nature of Ignite cluster?
> >>> And how do you propose handle nodes restart / stop?
> >>>
> >>> On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
> >>> alexeykukushkin@yahoo.com.invalid> wrote:
> >>>
> >>>> Hi Denis,
> >>>> Monitoring tools simply watch event logs for patterns (regex in case
> of
> >>>> unstructured logs like text files). A stable (not changing in new
> >> releases)
> >>>> event ID identifying specific issue would be such a pattern.
> >>>> We need to introduce such event IDs according to the principles I
> >>>> described in my previous mail.
> >>>> Best regards, Alexey
> >>>>
> >>>>
> >>>> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> >>>> dmagda@apache.org> wrote:
> >>>>
> >>>> Hello Alexey,
> >>>>
> >>>> Thanks for the detailed input.
> >>>>
> >>>> Assuming that Ignite supported the suggested events based model, how
> can
> >>>> it be integrated with mentioned tools like DynaTrace or Nagios? Is
> this
> >> all
> >>>> we need?
> >>>>
> >>>> —
> >>>> Denis
> >>>>
> >>>>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <
> >> alexeykukushkin@yahoo.com
> >>>> .INVALID> wrote:
> >>>>>
> >>>>> Igniters,
> >>>>> While preparing some Ignite materials for Administrators I found
> Ignite
> >>>> is not friendly for such a critical DevOps practice as monitoring.
> >>>>> TL;DRI think Ignite misses structured descriptions of abnormal events
> >>>> with references to event IDs in the logs not changing as new versions
> >> are
> >>>> released.
> >>>>> MORE DETAILS
> >>>>> I call an application “monitoring friendly” if it allows DevOps
to:
> >>>>> 1. immediately receive a notification (email, SMS, etc.)
> >>>>> 2. understand what a problem is without involving developers
> >>>>> 3. provide automated recovery action.
> >>>>>
> >>>>> Large enterprises do not implement custom solutions. They usually
use
> >>>> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> >>>> enterprise consistently. All such tools have similar architecture
> >> providing
> >>>> a dashboard showing apps as “green/yellow/red”, and numerous
> >> “connectors”
> >>>> to look for events in text logs, ESBs, database tables, etc.
> >>>>>
> >>>>> For each app DevOps build a “health model” - a diagram displaying
the
> >>>> app’s “manageable” components and the app boundaries. A “manageable”
> >>>> component is something that can be started/stopped/configured in
> >> isolation.
> >>>> “System boundary” is a list of external apps that the monitored
app
> >>>> interacts with.
> >>>>>
> >>>>> The main attribute of a manageable component is a list of
> >> “operationally
> >>>> significant events”. Those are the events that DevOps can do something
> >>>> with. For example, “failed to connect to cache store” is significant,
> >> while
> >>>> “user input validation failed” is not.
> >>>>>
> >>>>> Events shall be as specific as possible so that DevOps do not spend
> >> time
> >>>> for further analysis. For example, a “database failure” event is
not
> >> good.
> >>>> There should be “database connection failure”, “invalid database
> >> schema”,
> >>>> “database authentication failure”, etc. events.
> >>>>>
> >>>>> “Event” is NOT the same as exception occurred in the code. Events
> >>>> identify specific problem from the DevOps point of view. For example,
> >> even
> >>>> if “connection to cache store failed” exception might be thrown
from
> >>>> several places in the code, that is still the same event. On the other
> >>>> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
> >>>> exceptions might be caught in the same place, those are different
> events
> >>>> since MS SQL Server and Oracle are usually different DevOps groups in
> >> large
> >>>> enterprises!
> >>>>>
> >>>>> The operationally significant event IDs must be stable: they must
not
> >>>> change from one release to another. This is like a contract between
> >>>> developers and DevOps.
> >>>>>
> >>>>> This should be the developer’s responsibility to publish and
> maintain a
> >>>> table with attributes:
> >>>>>
> >>>>> - Event ID
> >>>>> - Severity: Critical (Red) - the system is not operational; Warning
> >>>> (Yellow) - the system is operational but health is degraded; None -
> >> just an
> >>>> info.
> >>>>> - Description: concise but enough for DevOps to act without
> developer’s
> >>>> help
> >>>>> - Recovery actions: what DevOps shall do to fix the issue without
> >>>> developer’s help. DevOps might create automated recovery scripts based
> >> on
> >>>> this information.
> >>>>>
> >>>>> For example:
> >>>>> 10100 - Critical - Could not connect to Zookeeper to discovery nodes
> -
> >>>> 1) Open ignite configuration and find zookeeper connection string 2)
> >> Make
> >>>> sure the Zookeeper is running
> >>>>> 10200 - Warning - Ignite node left the cluster.
> >>>>>
> >>>>> Back to Ignite: it looks to me we do not design for operations as
> >>>> described above. We have no event IDs: our logging is subject to
> change
> >> in
> >>>> new version so that any patterns DevOps might use to detect
> significant
> >>>> events would stop working after upgrade.
> >>>>>
> >>>>> If I am not the only one how have such concerns then we might open
a
> >>>> ticket to address this.
> >>>>>
> >>>>>
> >>>>> Best regards, Alexey
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Alexey Kuznetsov
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message