hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Capriolo (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-417) Implement Indexing in Hive
Date Wed, 23 Dec 2009 23:15:29 GMT

    [ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794263#action_12794263

Edward Capriolo commented on HIVE-417:

 I currently am benching an 11 node hive cluster against a 16 TB MySQL system 4x quad core
32 GB RAM 5.1 with partitioning.

Hive destroys mysql with any query like:

(date_id is my partition column.)

set mapred.map.tasks=34;
set mapred.reduce.tasks=11;
FROM pageviews
insert overwrite directory '/user/ecapriolo/hivetest4'
select sitename_id, user_id, count(user_id)   WHERE date_id=20091250 group by sitename_id,user_id
12098855 Rows loaded to /user/ecapriolo/hivetest4
Time taken: 185.528 seconds
The same query can take over 3000 seconds on MySQL because these large summary queries are
always written to a temp table and then writes bottleneck your read queries.

However, if mysql has an index (and if the index is in memory, which is hard in a warehouse)
on some other value in the where clause like:

select sitename_id, user_id, count(user_id)   WHERE date_id=20091250 and sitename_id=400 group
by sitename_id,user_id 
MySQL gets a relative performance speed-up, while hive ends up scanning the entire table.

I agree with dhruba,
>>This sounds really awesome! Make hadoop-hive suitable for things other than brute
force table-scans! 

If we had indexes helping stop some brute force scans, that would just open up other doors
to what hive could do.

> Implement Indexing in Hive
> --------------------------
>                 Key: HIVE-417
>                 URL: https://issues.apache.org/jira/browse/HIVE-417
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>    Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
>            Reporter: Prasad Chakka
>            Assignee: He Yongqiang
>         Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch
> Implement indexing on Hive so that lookup and range queries are efficient.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message