hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiaoyu Yao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12325) RPC Metrics : Add the ability track and log slow RPCs
Date Tue, 18 Aug 2015 19:21:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701827#comment-14701827
] 

Xiaoyu Yao commented on HADOOP-12325:
-------------------------------------

Thanks for working on this [~anu]. Below are some early comments:

1. Do you miss updating all the caller of  ProtobufRpcEngine.call() to pass receiveTime using
Time.monotonicNow() instead of the Time.now()? 

For example, the constructor of Call() is still use Time.now() and some callers ProtobufRpcEngine.call(Server,
String, Writable, long)  (org.apache.hadoop.ipc) also passes receiveTime returned from Time.now().


{code}
org.apache.hadoop.ipc.Server.java
private boolean processResponse(LinkedList<Call> responseQueue,
                                    boolean inHandler) throws IOException {
              // set the serve time when the response has to be sent later
              call.timestamp = Time.now();
{code}

Compare Time.monotonicNow() with Time.now() will produce unpredictable results. Searching
through the code, I found that changing all the Call#timestamp from Time.Now() to Time.MonotonicNow
is a broader change than the original purpose of this JIRA. It is probably better to just
keep the current usage of Time.now() AS-IS for this ticket and use a separate JIRA to track
changing Time.now() to Time.monotonic() for RPC call layer usage.

2. Do we need update WritableRpcEngine.java class with logSlowRpcCalls()?

3. NIT: Can you put the magic number 1024 as final variable like 
"final int minSampleSize = 1024" and 
move some of the comments to javadoc of the function: "Also we start this
682	   * logic only after we have enough sample size."
{code}
// 1024 is just a guess -- not a number computed based on
691	    // sample size analysis. It is chosen with the hope that this
692	    // number is high enough to avoid spurious logging, yet useful
693	    // in practice.
{code}

4. Can you change the following from
{code}
String message = String.format(
697	          "Slow RPC : %s took %d milliseconds to process. Rpc Request" +
698	              "  :  %s",
699	          request.getMethodName(), processingTime, request.toString());
700	      LOG.warn(message);
{code}

to the following for better logging performance. 
{code}
LOG.warn("Slow RPC : {} took {} milliseconds to process. Rpc Request" +
698	              "  :  {}",request.getMethodName(), processingTime, request.toString());
{code}

5. NIT: Rpc -> RPC to be consistent
{code}
73	   * @param server - RPC Server
	   * @param request - Rpc Request
{code}

6. Can you make the SleepRequestProto accepting a duration parameter instead of the fixed
SLEEP_DURATION (1000ms)? That will make it more useful for future tests without additional
code changes.

7. Is it possible to test with 1K fast calls instead of 10K calls to save test resources without
affecting the results?

> RPC Metrics : Add the ability track and log slow RPCs
> -----------------------------------------------------
>
>                 Key: HADOOP-12325
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12325
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc, metrics
>    Affects Versions: 2.7.1
>            Reporter: Anu Engineer
>            Assignee: Anu Engineer
>         Attachments: HADOOP-12325.001.patch
>
>
> This JIRA proposes to add a counter called RpcSlowCalls and also a configuration setting
that allows users to log really slow RPCs.  Slow RPCs are RPCs that fall at 99th percentile.
This is useful to troubleshoot why certain services like name node freezes under heavy load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message