hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <>
Subject [jira] [Commented] (HIVE-9188) BloomFilter support in ORC
Date Wed, 04 Feb 2015 05:27:35 GMT


Gopal V commented on HIVE-9188:

The predicate evaluation should always use min/max comparisons.

The min-max pruning is turned off for a column which has a bloom filter. This inadvertantly
turns off the the fastest check in favour of a slower check.

+        // if bloom filter exists, check in bloom filter else min/max stats
+        if (bloomFilter == null) {
+              loc = compareToRange((Comparable) predObj, minValue, maxValue);
+              if (loc == Location.MIN) {
+                return hasNull ? TruthValue.YES_NULL : TruthValue.YES;
+              }

I ran L_ORDERKEY filters with bloom filters and with min-max pruning. The rows-read were surprising
at the 1Tb scale.

With bloom filters:
Map 1                    198                7.88          1,162,490             16,270   
  2,960,000   8

Without bloom filters:
Map 1                    194                6.28          1,422,550             33,483   
    410,000   4

Without PPD, that actually reads 5,999,989,709 records in ~10s.

> BloomFilter support in ORC
> --------------------------
>                 Key: HIVE-9188
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>    Affects Versions: 0.15.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>              Labels: orcfile
>         Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch,
HIVE-9188.5.patch, HIVE-9188.6.patch, HIVE-9188.7.patch, HIVE-9188.8.patch, HIVE-9188.9.patch
> BloomFilters are well known probabilistic data structure for set membership checking.
We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group
index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy
predicate condition specified in the query. But in some cases, the efficiency of min/max based
elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can
be an effective and efficient alternative for row group/split elimination for point queries
or queries with IN clause.

This message was sent by Atlassian JIRA

View raw message