zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
Date Mon, 30 Apr 2018 04:43:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458303#comment-16458303
] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
-------------------------------------------

Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/307#discussion_r184910721
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
    @@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){
             packetsReceived = 0;
             packetsSent = 0;
         }
    +    synchronized public void resetNumRequestsAboveThresholdTime() {
    +        numRequestsAboveThresholdTime = 0;
    +    }
         synchronized public void reset() {
             resetLatency();
             resetRequestCounters();
    +        resetNumRequestsAboveThresholdTime();
    +    }
    +
    +    public void checkLatency(final ZooKeeperServer zks, Request request) {
    +        long requestLatency = Time.currentElapsedTime() - request.createTime;
    +        boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) ||
    +                (requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs);
    +        if (enabledAndAboveThreshold) {
    +            zks.serverStats().incNumRequestsAboveThresholdTime();
    +
    +            // Try acquiring lock only if not waiting
    +            boolean success = waitForLoggingWarnThresholdMsg.compareAndSet(Boolean.FALSE,
Boolean.TRUE);
    +            if (success) {
    +                LOG.warn("Request {} exceeded threshold. Took {} ms", request, requestLatency);
    +                startCount = zks.serverStats().getNumRequestsAboveThresholdTime();
    +                timer.schedule(new TimerTask() {
    +                    @Override
    +                    public void run() {
    +                        long count = zks.serverStats().getNumRequestsAboveThresholdTime()
- startCount;
    --- End diff --
    
    If there is no slow requests coming in between the time the task is scheduled and the
time the task is executed, this count will be 0. It will not reflect the actual number of
requests we want to log. Maybe log the startCount instead and reset it and leave the task
only to reset the barrier?
    
    Also please use ScheduledService instead of Timer task.


> ZooKeeper slow operation log
> ----------------------------
>
>                 Key: ZOOKEEPER-2770
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
>             Project: ZooKeeper
>          Issue Type: Improvement
>            Reporter: Karan Mehta
>            Assignee: Karan Mehta
>            Priority: Major
>         Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why any given
read or write operation may become slow: a software bug, a protocol problem, a hardware issue
with the commit log(s), a network issue. If the problem is constant it is trivial to come
to an understanding of the cause. However in order to diagnose intermittent problems we often
don't know where, or when, to begin looking. We need some sort of timestamped indication of
the problem. Although ZooKeeper is not a datastore, it does persist data, and can suffer intermittent
performance degradation, and should consider implementing a 'slow query' log, a feature very
common to services which persist information on behalf of clients which may be sensitive to
latency while waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally processing the
request, that the current time minus arrival time of the request is beyond a configured threshold.

> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message