hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasad Chakka (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-417) Implement Indexing in Hive
Date Fri, 15 May 2009 16:37:45 GMT

    [ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709899#action_12709899

Prasad Chakka commented on HIVE-417:

Here is a very rough outline as how this can be done (prototype code has creation and execution
parts but not he HiveQL related stuff)

hive indexing:
goal of hive indexing is to speed up lookup queries on certain columns of the table. currently
queries with predicates like 'WHERE tab1.col1 = 10' has to load the complete table/partition
and process all the rows. if there exists an index on col1 then only a small portion of the
file can be loaded.

command to create index:

create index tab_idx1 on tab1 (col1, ...);

if the base table is partitioned then the index is also partitioned. indexes can be created
on base tables whose file format supports (getPos() and possible seek() or equivalent methods.)

format of index:
index is also a hive table with the following columns
col_1...col_k -- key cols. base table columns on which this index is defined
list<offset>  -- positions of rows which contain these keys

offset is a combination of following
file_name   -- relative path of the file in which this row is contained. (relative to the
partition/table location)
byte_offset -- byte offset of the row in the file. row can be found at this byte offset or
in the block starting at this byte offset for Block Compressed Sequence Files.

when to create index:
traditionally databases try to update index when the table is loaded. hive doesn't process
rows while loading tables using 'LOAD DATA INPATH' command. also it may slow down the actual
loading for 'INSERT ... SELECT ... FROM ...' type of statements. so users should have an option
whether the index is initialized during 'INSERT ... SELECT ...' or initialized separately.
Another command like 'update index tab_idx partition ..' can be provided.

how to create index:
index can be created using the following hive command augmented with 'offset'
'select col_1...col_k, offset from tab1'

offset can be provided as built in function which can be derived in HiveInputRecordReader
which will in turn use the specific FileFormat's Reader getPos() method and the 'map.input.file'
for the file name (or from the tableDesc or partiionDesc).

Algorithm For using index:
1) Hive QL needs to determine whether a particular query can use any existing indexes. This
can be determined by examining the predicate tree. After predicate pushdown, all those predicates
which can use index are in the child operator of a TableScanOperator. This predicate tree
needs to be examined. If this contains any subset of columns of an index then that index can
be used. Until stats are available, it is not possible to guess whether using index is beneficial.
This needs to be fleshed out more to check both 'AND' and 'OR' predicates.

2) For each of the qualified indexes, a map/reduce job can be created using the predicates
determined in step 1. The output of this job should have the following information
file_name   -- fully qualified file name that contains the data
byte_offset -- position of row

3) If there is more than one qualified index then the outputs of step2 needs to be combined
depending on whether the predicates on these indexes have 'AND' or 'OR' between them.

4) Modify the original plan to use only those FileSplits that appear in the output of step3.
This reduces the number of mappers spawned by JobTracker.

5) Modify the original plan to use HiveIndexRecordReader instead of regular record reader.
Output of step3 (which is sorted) is available to the HiveIndexRecordReader. It can skip to
these locations instead of reading every record in the input of the Mapper.

> Implement Indexing in Hive
> --------------------------
>                 Key: HIVE-417
>                 URL: https://issues.apache.org/jira/browse/HIVE-417
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>    Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
>            Reporter: Prasad Chakka
> Implement indexing on Hive so that lookup and range queries are efficient.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message