Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 51D28CB59 for ; Wed, 7 Jan 2015 21:06:34 +0000 (UTC) Received: (qmail 95423 invoked by uid 500); 7 Jan 2015 21:06:35 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 95350 invoked by uid 500); 7 Jan 2015 21:06:35 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 95334 invoked by uid 500); 7 Jan 2015 21:06:35 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 95331 invoked by uid 99); 7 Jan 2015 21:06:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jan 2015 21:06:35 +0000 Date: Wed, 7 Jan 2015 21:06:35 +0000 (UTC) From: "Prasanth Jayachandran (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268215#comment-14268215 ] Prasanth Jayachandran commented on HIVE-9188: --------------------------------------------- The 0.05 fpp is guaranteed only at row index stride level that 10k rows by default. Merging the bloom filter to higher levels (stripe,file) will increase the fpp keeping the size constant. We will get worse fpp if we exceed the number of insertions in stripe level. We don't really need the file level bloom filter as its not useful considering we have stripe level statistics. If we have the bloom filter in row index we can read it in single IO per stripe. But we will end up reading the bloom filters of columns that does not participate in bloom filter. On the other hand if we have bloom filter as a separate stream we will end up with an extra IO op per stripe to read the bloom filter. Also having it as separate stream has additional costs (boolean flag in row index to know if we bloom filter for that column, position information). > BloomFilter in ORC row group index > ---------------------------------- > > Key: HIVE-9188 > URL: https://issues.apache.org/jira/browse/HIVE-9188 > Project: Hive > Issue Type: New Feature > Components: File Formats > Affects Versions: 0.15.0 > Reporter: Prasanth Jayachandran > Assignee: Prasanth Jayachandran > Labels: orcfile > Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch > > > BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)