hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Sichi (JIRA)" <>
Subject [jira] Commented: (HIVE-1694) Accelerate GROUP BY execution using indexes
Date Mon, 28 Feb 2011 19:51:38 GMT


John Sichi commented on HIVE-1694:

I'd like to propose a fourth option instead:  create a new handler type which stores both
the count and the offsets together, so that it can be used for both aggregation and filtering.
 The index build can still be done with a single GROUP BY, but now with three aggregate expressions
in the SELECT list:  collect_set (BLOCKOFFSETINSIDEFILE), COUNT(`l_shipdate`), COUNT(*). 
For a column known to be NOT NULL, just COUNT(*) is good enough, but Hive doesn't currently
have that metadata.  You could also use IDXPROPERTIES to allow for additional expressions
(SUM/MAX/MIN, complex expressions, etc), making these start to look more like materialized
aggregate views.

In HIVE-1803, they are working on factoring out some of the generic parts of compact index
handler for reuse; we should depend on that for the aggregate index handler to avoid duplicating

> Accelerate GROUP BY execution using indexes
> -------------------------------------------
>                 Key: HIVE-1694
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: Indexing, Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Nikhil Deshpande
>            Assignee: Nikhil Deshpande
>         Attachments: HIVE-1694.1.patch.txt, HIVE-1694_2010-10-28.diff, demo_q1.hql, demo_q2.hql
> The index building patch (Hive-417) is checked into trunk, this JIRA issue tracks supporting
indexes in Hive compiler & execution engine for SELECT queries.
> This is in ref. to John's comment at
> on creating separate JIRA issue for tracking index usage in optimizer & query execution.
> The aim of this effort is to use indexes to accelerate query execution (for certain class
of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose between index
scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold the information
about index based plans & operator implementations for above mentioned cases. 

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message