hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt McCline (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-9983) Vectorizer doesn't vectorize (1) partitions with different schema anywhere (2) any MapWork with >1 table scans in MR
Date Thu, 26 Nov 2015 02:34:11 GMT

    [ https://issues.apache.org/jira/browse/HIVE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028048#comment-15028048
] 

Matt McCline commented on HIVE-9983:
------------------------------------

With HIVE-11981 Schema Evolution case 1) now vectorizes.  Resolving 2) as Won't Fix.

> Vectorizer doesn't vectorize (1) partitions with different schema anywhere (2) any MapWork
with >1 table scans in MR
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-9983
>                 URL: https://issues.apache.org/jira/browse/HIVE-9983
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>            Reporter: Sergey Shelukhin
>            Assignee: Matt McCline
>
> For some test, tables are created as such:
> {noformat}
> CREATE TABLE orc_llap_part(
>     csmallint SMALLINT,
>     cint INT,
>     cbigint BIGINT,
>     cfloat FLOAT,
>     cdouble DOUBLE,
>     cstring1 STRING,
>     cstring2 STRING,
>     ctimestamp1 TIMESTAMP,
>     ctimestamp2 TIMESTAMP,
>     cboolean1 BOOLEAN,
>     cboolean2 BOOLEAN
> ) PARTITIONED BY (ctinyint TINYINT) STORED AS ORC;
> CREATE TABLE orc_llap_dim_part(
>     cbigint BIGINT
> ) PARTITIONED BY (ctinyint TINYINT) STORED AS ORC;
> INSERT OVERWRITE TABLE orc_llap_part PARTITION (ctinyint)
> SELECT csmallint, cint, cbigint, cfloat, cdouble, cstring1, cstring2, ctimestamp1, ctimestamp2,
cboolean1, cboolean2, ctinyint FROM alltypesorc;
> INSERT OVERWRITE TABLE orc_llap_dim_part PARTITION (ctinyint)
> SELECT sum(cbigint) as cbigint, ctinyint FROM alltypesorc WHERE ctinyint > 10 AND
ctinyint < 21 GROUP BY ctinyint;
> {noformat}
> The query is:
> {noformat}
> explain
>   SELECT oft.ctinyint, oft.cint FROM orc_llap_part oft
>   INNER JOIN orc_llap_dim_part od ON oft.ctinyint = od.ctinyint;
> {noformat}
> This results in a failure to vectorize in MR:
> {noformat}
> Could not vectorize partition pfile:/Users/sergey/git/hive3/itests/qtest/target/warehouse/orc_llap_dim_part/ctinyint=11.
 Its column names cbigint do not match the other column names csmallint,cint,cbigint,cfloat,cdouble,cstring1,cstring2,ctimestamp1,ctimestamp2,cboolean1,cboolean2
> {noformat}
> This is comparing schemas from different tables because MapWork has 2 TableScan-s; in
Tez this error will never happen as MapWork will not have 2 scans.
> In Tez (and MR as well), the other case can happen, namely partitions of the same table
having different schemas.
> Tez case can be solved by making a super-schema to include all variations and handling
missing columns where necessary.
> MR case may be harder to solve.
> Of note is that despite schema being different (and not a prefix of a schema by coincidence
or some such), query passes if validation is commented out. Perhaps in some cases it can work?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message