Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 14 Sep 2016 09:29:21 +0000 (UTC)
From: "Guanghao Zhang (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12986121.1467366834000.569691.1473845361500@Atlassian.JIRA>
In-Reply-To: <JIRA.12986121.1467366834000@Atlassian.JIRA>
References: <JIRA.12986121.1467366834000@Atlassian.JIRA> <JIRA.12986121.1467366834036@arcas>
Subject: [jira] [Commented] (HBASE-16165) Decrease RpcServer.callQueueSize
 before writeResponse causes OOM
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 14 Sep 2016 09:29:23 -0000


    [ https://issues.apache.org/jira/browse/HBASE-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489936#comment-15489936 ] 

Guanghao Zhang commented on HBASE-16165:
----------------------------------------

We observed an OOM case in our production cluster. Table A in source cluster has 500+ regions but it only has 1 region in slave cluster.  Then the mr job write a lot data in source cluster. It replicate to slave cluster and all data write to one regionserver. Then the regionserver crashed by OOM. One fix is to decrease RpcServer.callQueueSize when the responder wirte out the response really. Another fix is nullify the param early. Upload a little fix for this and set the param null when send response.

> Decrease RpcServer.callQueueSize before writeResponse causes OOM
> ----------------------------------------------------------------
>
>                 Key: HBASE-16165
>                 URL: https://issues.apache.org/jira/browse/HBASE-16165
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Priority: Minor
>         Attachments: HBASE-16165.patch
>
>
> In RpcServer, we use {{callQueueSizeInBytes}} to avoid queuing too many calls which causes OOM. But in {{CallRunner.run}}, we decrease it before send the response back. And even after calling {{sendResponseIfReady}}, the call object could stay in our heap for a long time if we can not write out the response(That's why we need a Responder thread...). This makes it possible that the actual size of all call object in heap is larger than {{maxQueueSizeInBytes}} and causes OOM.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)