hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pengcheng Xiong (JIRA)" <>
Subject [jira] [Updated] (HIVE-12763) Use bit vector to track NDV
Date Wed, 27 Jan 2016 03:20:39 GMT


Pengcheng Xiong updated HIVE-12763:
    Attachment: aggrStatsPerformance.png

as per [~jpullokkaran]'s request, I tested the time/space complexity of aggrStats performance
on my mac. The x-axis is the #partitions. y-axis is the time take to aggregate the stats of
#partitions in ms. We can see that as #partition increases, the aggrStats time increases.
But it runs quite fast, 475ms for 1000 partitions. I can not go beyond 1000 as my mac dies
after I increase it to 2000. Thus, the time complexity is pretty good mainly due to the simple
operation that we have (bit or). The space complexity is also good. For 16 bit vectors, each
bit vector is an array of at most 31 integers. And then multiply by the number of partitions.
In an extreme case, 1 million partition, the total space is 16*31*4B*1M (around 2GB). This
is the space we need when we want to store every bit vector in HBaseStore (without consideration
of serialization). When we aggregate the partition stats one by one, we need the memory of
16*31*4B*2 (around 4KB).

> Use bit vector to track NDV
> ---------------------------
>                 Key: HIVE-12763
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Pengcheng Xiong
>            Assignee: Pengcheng Xiong
>         Attachments: HIVE-12763.01.patch, HIVE-12763.02.patch, HIVE-12763.03.patch, HIVE-12763.04.patch,
HIVE-12763.05.patch, aggrStatsPerformance.png
> This will improve merging of per partitions stats. It will also help merge NDV for auto-gather
column stats.

This message was sent by Atlassian JIRA

View raw message