hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nikhil Deshpande (JIRA)" <>
Subject [jira] Updated: (HIVE-1694) Accelerate query execution using indexes
Date Thu, 28 Oct 2010 11:24:22 GMT


Nikhil Deshpande updated HIVE-1694:

    Status: Patch Available  (was: Open)

This is a patch to demonstrate query performance gains using indexes
(added in HIVE-417). The patch is over latest hive trunk.

ChangeLog for the patch:
- Implements a new rewrite for a certain set of queries with GROUP BY to speed those queries
by running them on index data instead of base table.
- Implements a skeleton generic rewrite engine.
- Implements the rewrite rule for a GroupBy queries set (mentioned above).  More details in
the class comment GbToCompactSumIdxRewrite.
- Rewrite needs to be currently explicitly enabled with a flag
- Modifies metastore & metadata API for getting some index info.
- Modifies QB metadata & parseblock code to add some rewrite assist methods.
- Inserts a rewrite hook into Semantic Analyzer.
- Fixes a bug in ql QTestUtil to clean-up indexed tables properly
- Contains new test for Group By rewrite using indexes: ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q

Quick performance test results on a very small Hadoop cluster:

2 queries (chosen to demonstrate perf gains) run on TPC-H benchmark data lineitem table.

Timings in seconds, data set size (1M, 1G etc.) is TPC-H scale factor.
               1M      1G       10G      30G 
  q1_no_idx  24.161   76.790  506.005  1551.555
q1_with_idx  21.268   27.292   35.502    86.133
  q1_no_idx  73.660  130.587  764.619  2146.423
q2_with_idx  69.393   75.493   92.867   190.619

Hadoop cluster description used for above perf test:
- 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in RAID5, 16GB RAM)
- 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered,
Hive tables stored in row-store format, HDFS replication factor: 2
- Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM)
- Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g. TPC-H 30GB data: 21GB
lineitem, ~180Million tuples)

These changes are being maintained at

> Accelerate query execution using indexes
> ----------------------------------------
>                 Key: HIVE-1694
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: Indexing, Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Nikhil Deshpande
> The index building patch (Hive-417) is checked into trunk, this JIRA issue tracks supporting
indexes in Hive compiler & execution engine for SELECT queries.
> This is in ref. to John's comment at
> on creating separate JIRA issue for tracking index usage in optimizer & query execution.
> The aim of this effort is to use indexes to accelerate query execution (for certain class
of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose between index
scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold the information
about index based plans & operator implementations for above mentioned cases. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message