hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-9983) Vectorizer doesn't vectorize (1) partitions with different schema (2) any MapWork with >1 table scans in MR
Date Mon, 16 Mar 2015 22:52:38 GMT
Sergey Shelukhin created HIVE-9983:
--------------------------------------

             Summary: Vectorizer doesn't vectorize (1) partitions with different schema (2)
any MapWork with >1 table scans in MR
                 Key: HIVE-9983
                 URL: https://issues.apache.org/jira/browse/HIVE-9983
             Project: Hive
          Issue Type: Bug
            Reporter: Sergey Shelukhin
            Assignee: Matt McCline


For some test, tables are created as such:
{noformat}

CREATE TABLE orc_llap_part(
    csmallint SMALLINT,
    cint INT,
    cbigint BIGINT,
    cfloat FLOAT,
    cdouble DOUBLE,
    cstring1 STRING,
    cstring2 STRING,
    ctimestamp1 TIMESTAMP,
    ctimestamp2 TIMESTAMP,
    cboolean1 BOOLEAN,
    cboolean2 BOOLEAN
) PARTITIONED BY (ctinyint TINYINT) STORED AS ORC;

CREATE TABLE orc_llap_dim_part(
    cbigint BIGINT
) PARTITIONED BY (ctinyint TINYINT) STORED AS ORC;


INSERT OVERWRITE TABLE orc_llap_part PARTITION (ctinyint)
SELECT csmallint, cint, cbigint, cfloat, cdouble, cstring1, cstring2, ctimestamp1, ctimestamp2,
cboolean1, cboolean2, ctinyint FROM alltypesorc;

INSERT OVERWRITE TABLE orc_llap_dim_part PARTITION (ctinyint)
SELECT sum(cbigint) as cbigint, ctinyint FROM alltypesorc WHERE ctinyint > 10 AND ctinyint
< 21 GROUP BY ctinyint;
{noformat}

The query is:
{noformat}
explain
  SELECT oft.ctinyint, oft.cint FROM orc_llap_part oft
  INNER JOIN orc_llap_dim_part od ON oft.ctinyint = od.ctinyint;
{noformat}

This results in a failure to vectorize in MR:
{noformat}
Could not vectorize partition pfile:/Users/sergey/git/hive3/itests/qtest/target/warehouse/orc_llap_dim_part/ctinyint=11.
 Its column names cbigint do not match the other column names csmallint,cint,cbigint,cfloat,cdouble,cstring1,cstring2,ctimestamp1,ctimestamp2,cboolean1,cboolean2
{noformat}

This is comparing schemas from different tables because MapWork has 2 TableScan-s; in Tez
this error will never happen as MapWork will not have 2 scans.
In Tez (and MR as well), the other case can happen, namely partitions of the same table having
different schemas.

Tez case can be solved by making a super-schema to include all variations and handling missing
columns where necessary.
MR case may be harder to solve.
Of note is that despite schema being different (and not a prefix of a schema by coincidence
or some such), query passes if validation is commented out. Perhaps in some cases it can work?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message