Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6846818677 for ; Wed, 27 Jan 2016 22:25:40 +0000 (UTC) Received: (qmail 67235 invoked by uid 500); 27 Jan 2016 22:25:40 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 67187 invoked by uid 500); 27 Jan 2016 22:25:40 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 67147 invoked by uid 99); 27 Jan 2016 22:25:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jan 2016 22:25:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id EBBFC2C1F5C for ; Wed, 27 Jan 2016 22:25:39 +0000 (UTC) Date: Wed, 27 Jan 2016 22:25:39 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15120292#comment-15120292 ] Hudson commented on HBASE-15171: -------------------------------- FAILURE: Integrated in HBase-1.3 #517 (See [https://builds.apache.org/job/HBase-1.3/517/]) HBASE-15171 Avoid counting duplicate kv and generating lots of small (tedyu: rev 630ad95c923f642d006274b9b1a14397a6713412) * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java > Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer > --------------------------------------------------------------------------------- > > Key: HBASE-15171 > URL: https://issues.apache.org/jira/browse/HBASE-15171 > Project: HBase > Issue Type: Sub-task > Affects Versions: 2.0.0, 1.1.2, 0.98.17 > Reporter: Yu Li > Assignee: Yu Li > Fix For: 2.0.0, 1.3.0 > > Attachments: HBASE-15171.patch, HBASE-15171.patch, HBASE-15171.patch > > > Once there was one of our online user writing huge number of duplicated kvs during bulkload, and we found it generated lots of small hfiles and slows down the whole process. > After debugging, we found in PutSortReducer#reduce, although it already tried to handle the pathological case by setting a threshold for single-row size and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude duplicated kv from the accumulated size. As shown in below code segment: > {code} > while (iter.hasNext() && curSize < threshold) { > Put p = iter.next(); > for (List cells: p.getFamilyCellMap().values()) { > for (Cell cell: cells) { > KeyValue kv = KeyValueUtil.ensureKeyValue(cell); > map.add(kv); > curSize += kv.heapSize(); > } > } > } > {code} > We should move the {{curSize += kv.heapSize();}} line out of the outer for loop -- This message was sent by Atlassian JIRA (v6.3.4#6332)