hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-417) Implement Indexing in Hive
Date Sun, 17 May 2009 13:59:45 GMT

    [ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710198#action_12710198

He Yongqiang commented on HIVE-417:

Thanks Prasad for detailed description of the index design. 
Several questions:
Is the index based on sort? Single-column index based on sort can be very useful for query
involving this column, both point query and range query. But for sort-based multi-column index,
it can not be utilized for queries not containing the column used as primary sort order in
the index. For example, we create an sort based index on table1(col1,col2,col3). The index
uses col1 as primary sort order, col2 as secondary sort order, and col3...
We can use this index to accelerate queries like:
1) select * from table1 where col1>2 and col2<34  
2) select * from table1 where col1<34 and col3 >45
3) selcet * from table1 where col1>23
but, we can not use it for queries like:
4) select * from table1 where col2>34 and col3<3
5) select * from table1 where col2 =34
6) select * from table1 where col3 <45

Should we consider using index to accelerate query involving join several tables. For example,
we have two tables:
user(userid,name,address, age,title,company);
And now we have a query like:
select url  from user, click where user.userid=click.userid and user.name="user_name" and
datetime between last month;  to select the url list the specified user visits in last month.

If we have an index: create index user_url on table user(name), click(datetime) where user.userid=click.userid,
then the above query can be accelerated.

Index can also be used in Group-by aggregation queries. Should we also consider them?
Another feature is to integrate Lucene index with Hive. Ashish suggested to integrate katta.
I took a look at katta, and i think it maybe not necessary to include katta in. If we include
it, the hive user will have to deploy katta and zookeeper in their cluster. I think we can
integrate lucene internally without touch katta.

> Implement Indexing in Hive
> --------------------------
>                 Key: HIVE-417
>                 URL: https://issues.apache.org/jira/browse/HIVE-417
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>    Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
>            Reporter: Prasad Chakka
>            Assignee: He Yongqiang
> Implement indexing on Hive so that lookup and range queries are efficient.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message