Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A1F9217EF2 for ; Wed, 4 Feb 2015 05:27:34 +0000 (UTC) Received: (qmail 90684 invoked by uid 500); 4 Feb 2015 05:27:35 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 90615 invoked by uid 500); 4 Feb 2015 05:27:35 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 90598 invoked by uid 500); 4 Feb 2015 05:27:35 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 90595 invoked by uid 99); 4 Feb 2015 05:27:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Feb 2015 05:27:35 +0000 Date: Wed, 4 Feb 2015 05:27:35 +0000 (UTC) From: "Gopal V (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-9188) BloomFilter support in ORC MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304661#comment-14304661 ] Gopal V commented on HIVE-9188: ------------------------------- The predicate evaluation should always use min/max comparisons. The min-max pruning is turned off for a column which has a bloom filter. This inadvertantly turns off the the fastest check in favour of a slower check. {code} + // if bloom filter exists, check in bloom filter else min/max stats + if (bloomFilter == null) { + loc = compareToRange((Comparable) predObj, minValue, maxValue); + if (loc == Location.MIN) { + return hasNull ? TruthValue.YES_NULL : TruthValue.YES; + } {code} I ran L_ORDERKEY filters with bloom filters and with min-max pruning. The rows-read were surprising at the 1Tb scale. {code} With bloom filters: VERTICES TOTAL_TASKS DURATION_SECONDS CPU_TIME_MILLIS GC_TIME_MILLIS INPUT_RECORDS Map 1 198 7.88 1,162,490 16,270 2,960,000 8 Without bloom filters: Map 1 194 6.28 1,422,550 33,483 410,000 4 {code} Without PPD, that actually reads 5,999,989,709 records in ~10s. > BloomFilter support in ORC > -------------------------- > > Key: HIVE-9188 > URL: https://issues.apache.org/jira/browse/HIVE-9188 > Project: Hive > Issue Type: New Feature > Components: File Formats > Affects Versions: 0.15.0 > Reporter: Prasanth Jayachandran > Assignee: Prasanth Jayachandran > Labels: orcfile > Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch, HIVE-9188.5.patch, HIVE-9188.6.patch, HIVE-9188.7.patch, HIVE-9188.8.patch, HIVE-9188.9.patch > > > BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)