hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer
Date Thu, 28 Jan 2016 19:49:39 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122196#comment-15122196
] 

Hudson commented on HBASE-15171:
--------------------------------

FAILURE: Integrated in HBase-Trunk_matrix #665 (See [https://builds.apache.org/job/HBase-Trunk_matrix/665/])
HBASE-15171 Addendum removes extra loop (Yu Li) (tedyu: rev 37ed0f6d0815389e0b368bc98b3a01dd02f193ac)
* hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java


> Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-15171
>                 URL: https://issues.apache.org/jira/browse/HBASE-15171
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.0.0, 1.1.2, 0.98.17
>            Reporter: Yu Li
>            Assignee: Yu Li
>             Fix For: 2.0.0, 1.3.0
>
>         Attachments: HBASE-15171.addendum.patch, HBASE-15171.patch, HBASE-15171.patch,
HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs during bulkload,
and we found it generated lots of small hfiles and slows down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried to handle
the pathological case by setting a threshold for single-row size and having a TreeMap to avoid
writing out duplicated kv, it forgot to exclude duplicated kv from the accumulated size. As
shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List<Cell> cells: p.getFamilyCellMap().values()) {
>     for (Cell cell: cells) {
>       KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>       map.add(kv);
>       curSize += kv.heapSize();
>     }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message