hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yin Huai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
Date Thu, 19 Sep 2013 12:52:54 GMT

    [ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771847#comment-13771847
] 

Yin Huai commented on HIVE-4113:
--------------------------------

READ_ALL_COLUMNS and READ_ALL_COLUMNS_DEFAULT are mainly created for HCat, because I think
it is a kind of burden to users if they have to be aware ColumnProjectionUtils and use it
every time. So, through HCat, if users do not use ColumnProjectionUtils to set needed columns,
we will read all columns. If we set READ_ALL_COLUMNS_DEFAULT=false, no column will be read
if a user does not use ColumnProjectionUtils.

In Hive, if we get rid off the flag of column pruning, the list of neededColumnIDs in TS will
not be null. Thus, in Hive, we will always set READ_ALL_COLUMNS to false (the .2 patch has
an issue on it... I will fix it later).

In summary, in Hive, we use neededColumnIDs in TS as the only way to tell a underlying recordreader
what to read. If neededColumnIDs is an empty list, we will know no needed column. Otherwise,
we will read columns specified in neededColumnIDs (if we have select * in a sub-query, neededColumnIDs
should be populated to include all columns).

In HCat, if a user wants to use the MapReduce interface, he or she has two ways to tell what
columns are needed. 1) This user does nothing. In this case, we will read all columns. 2)
This user uses utility functions in ColumnProjectionUtils (e.g. setReadColumnIDs) to specify
needed columns. In this case, READ_ALL_COLUMNS will be set to false and we only read columns
specified in READ_COLUMN_IDS_CONF_STR.

I hope what I am proposing makes sense. I am welcome to any suggestion :)
                
> Optimize select count(1) with RCFile and Orc
> --------------------------------------------
>
>                 Key: HIVE-4113
>                 URL: https://issues.apache.org/jira/browse/HIVE-4113
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats
>            Reporter: Gopal V
>            Assignee: Yin Huai
>             Fix For: 0.12.0
>
>         Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.patch,
HIVE-4113.patch
>
>
> select count(1) loads up every column & every row when used with RCFile.
> "select count(1) from store_sales_10_rc" gives
> {code}
> Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 HDFS Write:
8 SUCCESS
> {code}
> Where as, "select count(ss_sold_date_sk) from store_sales_10_rc;" reads far less
> {code}
> Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 HDFS Write:
8 SUCCESS
> {code}
> Which is 11% of the data size read by the COUNT(1).
> This was tracked down to the following code in RCFile.java
> {code}
>       } else {
>         // TODO: if no column name is specified e.g, in select count(1) from tt;
>         // skip all columns, this should be distinguished from the case:
>         // select * from tt;
>         for (int i = 0; i < skippedColIDs.length; i++) {
>           skippedColIDs[i] = false;
>         }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message