Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Wed, 4 Feb 2015 05:27:35 +0000 (UTC)
From: "Gopal V (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <JIRA.12763154.1419218359000.250626.1423027655126@Atlassian.JIRA>
In-Reply-To: <JIRA.12763154.1419218359000@Atlassian.JIRA>
References: <JIRA.12763154.1419218359000@Atlassian.JIRA>
 <JIRA.12763154.1419218359665@arcas>
Subject: [jira] [Commented] (HIVE-9188) BloomFilter support in ORC
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304661#comment-14304661 ] 

Gopal V commented on HIVE-9188:
-------------------------------

The predicate evaluation should always use min/max comparisons.

The min-max pruning is turned off for a column which has a bloom filter. This inadvertantly turns off the the fastest check in favour of a slower check.

{code}
+        // if bloom filter exists, check in bloom filter else min/max stats
+        if (bloomFilter == null) {
+              loc = compareToRange((Comparable) predObj, minValue, maxValue);
+              if (loc == Location.MIN) {
+                return hasNull ? TruthValue.YES_NULL : TruthValue.YES;
+              }
{code}

I ran L_ORDERKEY filters with bloom filters and with min-max pruning. The rows-read were surprising at the 1Tb scale.

{code}
With bloom filters:
VERTICES         TOTAL_TASKS   DURATION_SECONDS    CPU_TIME_MILLIS     GC_TIME_MILLIS  INPUT_RECORDS    
Map 1                    198                7.88          1,162,490             16,270      2,960,000   8

Without bloom filters:
Map 1                    194                6.28          1,422,550             33,483        410,000   4
{code}

Without PPD, that actually reads 5,999,989,709 records in ~10s.

> BloomFilter support in ORC
> --------------------------
>
>                 Key: HIVE-9188
>                 URL: https://issues.apache.org/jira/browse/HIVE-9188
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>    Affects Versions: 0.15.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>              Labels: orcfile
>         Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch, HIVE-9188.5.patch, HIVE-9188.6.patch, HIVE-9188.7.patch, HIVE-9188.8.patch, HIVE-9188.9.patch
>
>
> BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)