cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-10245) Provide after the fact visibility into the reliability of the environment C* operates in
Date Tue, 01 Sep 2015 21:30:46 GMT
Ariel Weisberg created CASSANDRA-10245:
------------------------------------------

             Summary: Provide after the fact visibility into the reliability of the environment
C* operates in
                 Key: CASSANDRA-10245
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10245
             Project: Cassandra
          Issue Type: New Feature
          Components: Core
            Reporter: Ariel Weisberg
             Fix For: 3.x


I think that by default databases should not be completely dependent on operator provided
tools for monitoring node and network health.

The database should be able to detect and report on several dimensions of performance in its
environment, and more specifically report on deviations from acceptable performance.

* Node wide pauses
* JVM wide pauses
* Latency, and roundtrip time to all endpoints
* Block device IO latency

If flight recorder were available for use in production I would say as a start just turn that
on, add jHiccup (inside and outside the server process), and a daemon inside the server to
measure network performance between endpoints.

FR is not available (requires a license in production) so instead focus on adding instrumentation
for the most useful facets of flight recorder in diagnosing performance issues. I think we
can get pretty far because what we need to do is not quite as undirected as the exploration
FR and JMC facilitate.

Until we dial in how we measure and how to signal without false positives I would expect this
kind of logging to be in the background for post-hoc analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message