hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-417) Implement Indexing in Hive
Date Tue, 08 Jun 2010 21:49:15 GMT

     [ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

He Yongqiang updated HIVE-417:
------------------------------

    Attachment: hive-indexing.3.patch

With this patch, the index can work. but it is not so intelligent. 

This is how this patch works:

=== how to create the index table and generate index data ===
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

drop table src_rc_index;

//create an index table on table src_rc, and the index col is key. 
//And the index table's data is stored using textfile (also work with seq, rcfile)
create index src_rc_index type compact on table src_rc(key) stored as textfile; 

hive> show table extended like src_rc_index;
tableName:src_rc_index
owner:heyongqiang
location:file:/user/hive/warehouse/src_rc_index
inputformat:org.apache.hadoop.mapred.TextInputFormat
outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
columns:struct columns { i32 key, string _bucketname, list<string> _offsets}

About the index table's schema. besides the index columns from the base table, the index table
has two more columns (_bucketname string, array(string) offsets )


//generate the actuall index table's data (here also support partition)
update index src_rc_index;

====How to use the index table====

//find the offset for 'key=0' in the index table, and put the bucketname and offset list in
a temp directory
insert overwrite directory "/tmp/index_result" select `_bucketname` ,  `_offsets` from src_rc_index
where key=0;

set hive.exec.index_file=/tmp/index_result; 

//use a new index file format to prune inputsplit based on the offset list 
//stored in "hive.exec.index_file" which is populated in previous command
set hive.input.format=org.apache.hadoop.hive.ql.index.io.HiveIndexInputFormat;

//this query will not scan the whole base data
select key, value from src_rc where key=0;


Things done in the patch:
1) hql command for creating index table
2) hql command and map-reduce job for updating index (generating the index table's data).

3) a HiveIndexInputFormat to leverage the offsets got from index table to reduce number of
blocks/map-tasks

Things need to be done:
1) right now the index table is manually specified in queries. we need this to be more intelligent
by automatically generating the plan using index .
2) The HiveIndexInputFormat needs a new RecordReader to seek to a given offset instead of
scanning the whole block. 
3) right now we use a map-reduce job to scan the whole index table to find hits offsets. But
since the index table is sorted, we can leverage the sort property to avoid the map-reduce
job in many cases. (easiest way is to do a binary search in client.)

The first todo is the most important part.  I think the third may need much more work (maybe
not true).

(Note: although this patch has been tested in production cluster, it could still have bugs.
We will be really appreciate if you can report bugs you find here.)

> Implement Indexing in Hive
> --------------------------
>
>                 Key: HIVE-417
>                 URL: https://issues.apache.org/jira/browse/HIVE-417
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>    Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>            Reporter: Prasad Chakka
>            Assignee: He Yongqiang
>         Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message