hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-417) Implement Indexing in Hive
Date Tue, 22 Sep 2009 04:18:16 GMT

    [ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758116#action_12758116

Joydeep Sen Sarma commented on HIVE-417:

are there any references on this technique?

someone had earlier suggested this (apparently from reading Netezza documentation) - but i
don't understand when it would work. why would a (fairly large) sequencefile block only limited
range of values (assuming the metadata stores a min-max range). most cases i can imagine in
our dataset would either have low cardinality columns (so most values would be present) or
for large cardinality ones - the distribution would be random (relative to the primary sort
key) - and the range would seem ineffective.

unless there are columns that are closely related to the how data is sorted/partitioned (perhaps
some product ids are limited to specific range of time - but the partitioning is on time and
not product id - and even that sounds dubious).

a bloom filter would seem much more plausible at allowing good filtering. even then don't
understand why this sort of metadata should be kept along with the block and not separately
(much more flexible - can be added on demand) as this jira is headed towards.

> Implement Indexing in Hive
> --------------------------
>                 Key: HIVE-417
>                 URL: https://issues.apache.org/jira/browse/HIVE-417
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>    Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
>            Reporter: Prasad Chakka
>            Assignee: He Yongqiang
>         Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch
> Implement indexing on Hive so that lookup and range queries are efficient.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message