hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Sichi (JIRA)" <>
Subject [jira] [Commented] (HIVE-1694) Accelerate GROUP BY execution using indexes
Date Mon, 23 May 2011 18:14:48 GMT


John Sichi commented on HIVE-1694:

I collected comments from last week's review meeting below.

* The rewrite needs to check to make sure that the index partitions are available (matching
the referenced table partitions).  You can take a look at the way the Harvey Mudd team handles
this, and maybe reuse their code.  This implies that predicate pushdown and partition pruning
need to happen BEFORE the rewrite is applied (currently the rewrite happens before them).

* Isn't it a bug that the GROUP BY is removed in some cases?  The index may store multiple
rows for the same base table key (since FILENAME is part of the index table key), so it seems
like a GROUP BY should always be required for removing those duplicates.

* Where is _countall used instead of _countkey?  Also, what happens if the index is compound
(multiple columns in its key)?

* Add a test case for a query in which a table scan is reused in a directed acyclic graph,
e.g. a UNION where one branch of the union does a rewritable GROUP BY on the table and the
other branch just reads the table directly.  We want to make sure that in this case, the rewrite's
replacement of the base table in one branch does not corrupt the other branch in any way.

After these have been addressed (along with the existing review board comments) and you've
had a chance to rebase the patch, we'll do another pass.

Thanks again!

> Accelerate GROUP BY execution using indexes
> -------------------------------------------
>                 Key: HIVE-1694
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: Indexing, Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Nikhil Deshpande
>            Assignee: Prajakta Kalmegh
>         Attachments: HIVE-1694.1.patch.txt, HIVE-1694.2.patch.txt, HIVE-1694.3.patch.txt,
HIVE-1694_2010-10-28.diff, demo_q1.hql, demo_q2.hql
> The index building patch (Hive-417) is checked into trunk, this JIRA issue tracks supporting
indexes in Hive compiler & execution engine for SELECT queries.
> This is in ref. to John's comment at
> on creating separate JIRA issue for tracking index usage in optimizer & query execution.
> The aim of this effort is to use indexes to accelerate query execution (for certain class
of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose between index
scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold the information
about index based plans & operator implementations for above mentioned cases. 

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message