hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
Date Wed, 07 Jan 2015 18:41:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268002#comment-14268002
] 

Gopal V commented on HIVE-9188:
-------------------------------

[~owen.omalley]: the stream has the issue that it's read after the disk ranges are computed
(& read). So we don't get the IO savings with the stream approach.

The row-group stats is the only bit of data that is read ahead of the actual HDFS IO ops,
which lets us skip the reads off the disk.

> BloomFilter in ORC row group index
> ----------------------------------
>
>                 Key: HIVE-9188
>                 URL: https://issues.apache.org/jira/browse/HIVE-9188
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>    Affects Versions: 0.15.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>              Labels: orcfile
>         Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch
>
>
> BloomFilters are well known probabilistic data structure for set membership checking.
We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group
index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy
predicate condition specified in the query. But in some cases, the efficiency of min/max based
elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can
be an effective and efficient alternative for row group/split elimination for point queries
or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message