hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Remus Rusanu (JIRA)" <>
Subject [jira] [Commented] (HIVE-4676) Optimize COUNT(*) aggregate over vectorized ORC execution path
Date Fri, 07 Jun 2013 21:08:20 GMT


Remus Rusanu commented on HIVE-4676:

The count(*) aggregate is already optimized to just increment the count with the batch size.
We just need to make sure the iterator doe snot read extra columns.
> Optimize COUNT(*) aggregate over vectorized ORC execution path
> --------------------------------------------------------------
>                 Key: HIVE-4676
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>    Affects Versions: vectorization-branch
>            Reporter: Eric Hanson
> The COUNT\(*\) aggregate with the vectorized execution path over ORC should be optimized
because it is a very common case.
> Given a table factsqlengineam_vec_orc with about 25 columns and 218 million rows, this
> select count\(*\) from factsqlengineam_vec_orc;
> runs in 2 minutes 15 seconds, with HDFS Read 2,000,078,555
> and this query
> select count(mrowflag) from factsqlengineam_vec_orc;
> runs in 42 seconds, with HDFS Read 1,207,855
> Because the column mrowflag is non-null, both queries return the same result.
> We should optimize count\(\*\) so that it, say, chooses the most-compressed column from
the ORC file (or even a single random column) and counts those values, but logically counts
null values too so the meaning is the same as count\(*\). The vectorized iterator should not
have to load all columns, just one column minimum, and any columns being filtered in the WHERE
> For scalar count\(*\) aggregates (i.e. without group-by) we can simply tally up the total
number of remaining rows in each batch, without even looking at the data. Maybe we're already
doing that but we are reading more data than necessary now.
> The query 
> select count\(\*\) from factsqlengineam_vec_orc where mrowflag > 0;
> Takes only 41 seconds and also reads 1,207,855 bytes, so it appears that when there is
no WHERE clause, more data is read.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message