hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Bulkload discards duplicates
Date Mon, 12 Mar 2012 15:20:22 GMT
On Mon, Mar 12, 2012 at 8:17 AM, Laxman <lakshman.ch@huawei.com> wrote:
> In our test, we noticed that bulkload is discarding the duplicates.
> On further analysis, I noticed duplicates are getting discarded only
> duplicates exists in same input file and in same split.
> I think this is a bug and its not any intentional behavior.
>
> Usage of TreeSet in the below code snippet is causing the issue.
>
> PutSortReducer.reduce()
> ======================
>      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
>      long curSize = 0;
>      // stop at the end or the RAM threshold
>      while (iter.hasNext() && curSize < threshold) {
>        Put p = iter.next();
>        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
>          for (KeyValue kv : kvs) {
>            map.add(kv);
>            curSize += kv.getLength();
>          }
>        }
>
> Changing this back to List and then sort explicitly will solve the issue.
>
> Filed a new JIRA for this
> https://issues.apache.org/jira/browse/HBASE-5564

Thank you for finding the issue and making a JIRA.
St.Ack

Mime
View raw message