hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Laxman <lakshman...@huawei.com>
Subject Bulkload discards duplicates
Date Mon, 12 Mar 2012 15:17:34 GMT
In our test, we noticed that bulkload is discarding the duplicates.
On further analysis, I noticed duplicates are getting discarded only
duplicates exists in same input file and in same split.
I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
      long curSize = 0;
      // stop at the end or the RAM threshold
      while (iter.hasNext() && curSize < threshold) {
        Put p = iter.next();
        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
          for (KeyValue kv : kvs) {
            curSize += kv.getLength();

Changing this back to List and then sort explicitly will solve the issue.

Filed a new JIRA for this

View raw message