Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Fri, 27 May 2016 09:44:12 +0000 (UTC)
From: "Gopal V (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.12973291.1464321444000.314735.1464342252845@Atlassian.JIRA>
In-Reply-To: <JIRA.12973291.1464321444000@Atlassian.JIRA>
References: <JIRA.12973291.1464321444000@Atlassian.JIRA> <JIRA.12973291.1464321444310@arcas>
Subject: [jira] [Comment Edited] (HIVE-13872) Vectorization: Fix
 cross-product reduce sink serialization
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 27 May 2016 09:44:14 -0000


    [ https://issues.apache.org/jira/browse/HIVE-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303808#comment-15303808 ] 

Gopal V edited comment on HIVE-13872 at 5/27/16 9:43 AM:
---------------------------------------------------------

The patch clearly breaks TEXT vectorization, which relies on this column count to vectorize readers.

{code}
2016-05-27T05:13:14,391 INFO  [main]: physical.Vectorizer (:()) - Validating MapWork...
2016-05-27T05:13:14,391 INFO  [main]: physical.Vectorizer (:()) - Vectorizer path: hdfs://sandbox.hortonworks.com:8020/tmp/tpcds_dataset/200/200/item, vector map operator read type VECTOR_DESERIALIZE, input file format class name org.apache.hadoop.mapred.TextInputFormat, deserialize type LAZY_SIMPLE, aliases [item]
2016-05-27T05:13:14,391 INFO  [main]: physical.Vectorizer (:()) - Could not vectorize partition hdfs://sandbox.hortonworks.com:8020/tmp/tpcds_dataset/200/200/item (deserializer org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)The partition column names 22 is greater than the number of table columns 2
{code}


was (Author: gopalv):
The patch clearly breaks TEXT vectorization, which relies on this column count to vectorize readers.

{code}
2016-05-27T05:13:14,391 INFO  [main]: physical.Vectorizer (:()) - Validating MapWork...
2016-05-27T05:13:14,391 INFO  [main]: physical.Vectorizer (:()) - Vectorizer path: hdfs://sandbox.l42scl.hortonworks.com:8020/tmp/tpcds_dataset/200/200/item, vector map operator read type VECTOR_DESERIALIZE, input file format class name org.apache.hadoop.mapred.TextInputFormat, deserialize type LAZY_SIMPLE, aliases [item]
2016-05-27T05:13:14,391 INFO  [main]: physical.Vectorizer (:()) - Could not vectorize partition hdfs://cn108-10.l42scl.hortonworks.com:8020/tmp/tpcds_dataset/200/200/item (deserializer org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)The partition column names 22 is greater than the number of table columns 2
{code}

> Vectorization: Fix cross-product reduce sink serialization
> ----------------------------------------------------------
>
>                 Key: HIVE-13872
>                 URL: https://issues.apache.org/jira/browse/HIVE-13872
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>    Affects Versions: 2.1.0
>            Reporter: Gopal V
>         Attachments: HIVE-13872.WIP.patch
>
>
> TPC-DS Q13 produces a cross-product without CBO simplifying the query
> {code}
> Caused by: java.lang.RuntimeException: null STRING entry: batchIndex 0 projection column num 1
>         at org.apache.hadoop.hive.ql.exec.vector.VectorExtractRow.nullBytesReadError(VectorExtractRow.java:349)
>         at org.apache.hadoop.hive.ql.exec.vector.VectorExtractRow.extractRowColumn(VectorExtractRow.java:267)
>         at org.apache.hadoop.hive.ql.exec.vector.VectorExtractRow.extractRow(VectorExtractRow.java:343)
>         at org.apache.hadoop.hive.ql.exec.vector.VectorReduceSinkOperator.process(VectorReduceSinkOperator.java:103)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
>         at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
>         at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:762)
>         ... 18 more
> {code}
> Simplified query
> {code}
> set hive.cbo.enable=false;
> -- explain
> select count(1)  
>  from store_sales
>      ,customer_demographics
>  where (
> ( 
>   customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk
>   and customer_demographics.cd_marital_status = 'M'
>      )or
>      (
>    customer_demographics.cd_demo_sk = ss_cdemo_sk
>   and customer_demographics.cd_marital_status = 'U'
>      ))
> ;
> {code}
> {code}
>         Map 3 
>             Map Operator Tree:
>                 TableScan
>                   alias: customer_demographics
>                   Statistics: Num rows: 1920800 Data size: 717255532 Basic stats: COMPLETE Column stats: NONE
>                   Reduce Output Operator
>                     sort order: 
>                     Statistics: Num rows: 1920800 Data size: 717255532 Basic stats: COMPLETE Column stats: NONE
>                     value expressions: cd_demo_sk (type: int), cd_marital_status (type: string)
>             Execution mode: vectorized, llap
>             LLAP IO: all inputs
> {code}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)