Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5065E200B91 for ; Wed, 14 Sep 2016 11:29:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 4F467160AB4; Wed, 14 Sep 2016 09:29:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9D738160ABA for ; Wed, 14 Sep 2016 11:29:22 +0200 (CEST) Received: (qmail 27304 invoked by uid 500); 14 Sep 2016 09:29:21 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 27238 invoked by uid 99); 14 Sep 2016 09:29:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Sep 2016 09:29:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7AD442C0D5C for ; Wed, 14 Sep 2016 09:29:21 +0000 (UTC) Date: Wed, 14 Sep 2016 09:29:21 +0000 (UTC) From: "Guanghao Zhang (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-16165) Decrease RpcServer.callQueueSize before writeResponse causes OOM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 14 Sep 2016 09:29:23 -0000 [ https://issues.apache.org/jira/browse/HBASE-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489936#comment-15489936 ] Guanghao Zhang commented on HBASE-16165: ---------------------------------------- We observed an OOM case in our production cluster. Table A in source cluster has 500+ regions but it only has 1 region in slave cluster. Then the mr job write a lot data in source cluster. It replicate to slave cluster and all data write to one regionserver. Then the regionserver crashed by OOM. One fix is to decrease RpcServer.callQueueSize when the responder wirte out the response really. Another fix is nullify the param early. Upload a little fix for this and set the param null when send response. > Decrease RpcServer.callQueueSize before writeResponse causes OOM > ---------------------------------------------------------------- > > Key: HBASE-16165 > URL: https://issues.apache.org/jira/browse/HBASE-16165 > Project: HBase > Issue Type: Bug > Reporter: Duo Zhang > Priority: Minor > Attachments: HBASE-16165.patch > > > In RpcServer, we use {{callQueueSizeInBytes}} to avoid queuing too many calls which causes OOM. But in {{CallRunner.run}}, we decrease it before send the response back. And even after calling {{sendResponseIfReady}}, the call object could stay in our heap for a long time if we can not write out the response(That's why we need a Responder thread...). This makes it possible that the actual size of all call object in heap is larger than {{maxQueueSizeInBytes}} and causes OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)