Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C46E7200B41 for ; Thu, 7 Jul 2016 18:52:12 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C3078160A68; Thu, 7 Jul 2016 16:52:12 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 175D7160A4F for ; Thu, 7 Jul 2016 18:52:11 +0200 (CEST) Received: (qmail 45818 invoked by uid 500); 7 Jul 2016 16:52:11 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 45806 invoked by uid 99); 7 Jul 2016 16:52:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jul 2016 16:52:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id EB70F2C0003 for ; Thu, 7 Jul 2016 16:52:10 +0000 (UTC) Date: Thu, 7 Jul 2016 16:52:10 +0000 (UTC) From: "Yu Li (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-16193) Memory leak when putting plenty of duplicated cells MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 07 Jul 2016 16:52:13 -0000 [ https://issues.apache.org/jira/browse/HBASE-16193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Li updated HBASE-16193: -------------------------- Attachment: MemoryLeakInMemStore_2.png MemoryLeakInMemStore.png Attached MAT heapdump analyze screenshot shows that one MemStore is taking ~6GB memory, and there're 2600 chunks in it, which is a strong proof of the 1st problem. > Memory leak when putting plenty of duplicated cells > --------------------------------------------------- > > Key: HBASE-16193 > URL: https://issues.apache.org/jira/browse/HBASE-16193 > Project: HBase > Issue Type: Bug > Reporter: Yu Li > Assignee: Yu Li > Attachments: MemoryLeakInMemStore.png, MemoryLeakInMemStore_2.png > > > Recently we suffered from a weird problem that RS heap size could not reduce much even after FullGC, and it kept FullGC and could hardly serve any request. After debugging for days, we found the root cause: we won't count in the allocated memory in MSLAB chunk when adding duplicated cells (including put and delete). We have below codes in {{AbstractMemStore#add}} (or {{DefaultMemStore#add}} for branch-1): > {code} > public long add(Cell cell) { > Cell toAdd = maybeCloneWithAllocator(cell); > return internalAdd(toAdd); > } > {code} > where we will allocate memory in MSLAB (if using) chunk for the cell first, and then call {{internalAdd}}, where we could see below codes in {{Segment#internalAdd}} (or {{DefaultMemStore#internalAdd}} for branch-1): > {code} > protected long internalAdd(Cell cell) { > boolean succ = getCellSet().add(cell); > long s = AbstractMemStore.heapSizeChange(cell, succ); > updateMetaInfo(cell, s); > return s; > } > {code} > So if we are writing a duplicated cell, we assume there's no heap size change, while actually the chunk size is taken (referenced). > Let's assume this scenario, that there're huge amount of writing on the same cell (same key, different values), which is not that special in MachineLearning use case, and there're also few normal writes, and after some long time, it's possible that we have many chunks with kvs like: {{cellA, cellB, cellA, cellA, .... cellA}}, that we only counts 2 cells for each chunk, but actually the chunk is full. So the devil comes, that we think it's still not hitting flush size, while there's already GBs heapsize taken. > There's also a more extreme case, that we only writes a single cell over and over again and fills one chunk quickly. Ideally the chunk should be cleared by GC, but unfortunately we have kept a redundant reference in {{HeapMemStore#chunkQueue}}, which is useless when we're not using chunkPool by default. > This is the umbrella to describe the problem, and I'll open two sub-JIRAs to resolve the above two issues separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)