hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Hanson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-4676) Optimize COUNT(*) aggregate over vectorized ORC execution path
Date Thu, 06 Jun 2013 16:29:21 GMT
Eric Hanson created HIVE-4676:
---------------------------------

             Summary: Optimize COUNT(*) aggregate over vectorized ORC execution path
                 Key: HIVE-4676
                 URL: https://issues.apache.org/jira/browse/HIVE-4676
             Project: Hive
          Issue Type: Sub-task
          Components: Query Processor
    Affects Versions: vectorization-branch
            Reporter: Eric Hanson


The COUNT(*) aggregate with the vectorized execution path over ORC should be optimized because
it is a very common case.

Given a table factsqlengineam_vec_orc with about 25 columns and 218 million rows, this query

select count(*) from factsqlengineam_vec_orc;

runs in 2 minutes 15 seconds

and this query

select count(mrowflag) from factsqlengineam_vec_orc;

runs in 42 seconds.

Because the column mrowflag is non-null, both queries return the same result.

We should optimize count(*) so that it, say, chooses the most-compressed column from the ORC
file (or even a single random column) and counts those values, but logically counts null values
too so the meaning is the same as count(*). The vectorized iterator should not have to load
all columns, just one column minimum, and any columns being filtered in the WHERE clause.

For scalar count(*) aggregates (i.e. without group-by) we can simply tally up the total number
of remaining rows in each batch, without even looking at the data. Maybe we're already doing
that but something is taking up extra time now.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message