hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <lhofha...@yahoo.com>
Subject Re: Bulkload discards duplicates
Date Mon, 12 Mar 2012 16:41:35 GMT
Hi Laxman,

can you clarify what you mean by "duplicates"?
The TreeSet is using KeyValue.COMPARATOR,which treats KVs as the same only if the entire key
(including column and timestamp) is the same.
Do you have KVs with the same rowKey, columnKey, and timestamp, but different values?


-- Lars

 From: Laxman <lakshman.ch@huawei.com>
To: dev@hbase.apache.org; user@hbase.apache.org 
Sent: Monday, March 12, 2012 8:17 AM
Subject: Bulkload discards duplicates
In our test, we noticed that bulkload is discarding the duplicates.
On further analysis, I noticed duplicates are getting discarded only
duplicates exists in same input file and in same split.
I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
      long curSize = 0;
      // stop at the end or the RAM threshold
      while (iter.hasNext() && curSize < threshold) {
        Put p = iter.next();
        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
          for (KeyValue kv : kvs) {
            curSize += kv.getLength();

Changing this back to List and then sort explicitly will solve the issue.

Filed a new JIRA for this
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message