hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Zinovyev <andrey.zinov...@gmail.com>
Subject Re: hive 3.1 mapjoin with complex predicate produce incorrect results
Date Mon, 24 Dec 2018 09:37:58 GMT
Yep, "set hive.vectorized.reuse.scratch.columns=false;" fixes the problem.
And it is definitely something wrong with 'if', without it everything works
fine;

explain vectorization detail

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| PLAN VECTORIZATION:                                |
|   enabled: true                                    |
|   enabledConditionsMet: [hive.vectorized.execution.enabled IS true] |
|                                                    |
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Tez                                            |
|       DagId:
hive_20181224123402_3536cf16-bb5b-496c-b196-417d6dff4be0:11867 |
|       Edges:                                       |
|         Map 1 <- Map 2 (BROADCAST_EDGE)            |
|       DagName:
hive_20181224123402_3536cf16-bb5b-496c-b196-417d6dff4be0:11867 |
|       Vertices:                                    |
|         Map 1                                      |
|             Map Operator Tree:                     |
|                 TableScan                          |
|                   alias: xs                        |
|                   Statistics: Num rows: 5 Data size: 20 Basic stats:
COMPLETE Column stats: COMPLETE |
|                   TableScan Vectorization:         |
|                       native: true                 |
|                       vectorizationSchemaColumns: [0:key:int, 1:a:int,
2:ROW__ID:struct<writeid:bigint,bucketid:int,rowid:bigint>] |
|                   Select Operator                  |
|                     expressions: key (type: int)   |
|                     outputColumnNames: _col0       |
|                     Select Vectorization:          |
|                         className: VectorSelectOperator |
|                         native: true               |
|                         projectedOutputColumnNums: [0] |
|                     Statistics: Num rows: 5 Data size: 20 Basic stats:
COMPLETE Column stats: COMPLETE |
|                     Map Join Operator              |
|                       condition map:               |
|                            Left Outer Join 0 to 1  |
|                       keys:                        |
|                         0 if(_col0 is null, 44, _col0) (type: int) |
|                         1 _col0 (type: int)        |
|                       Map Join Vectorization:      |
|                           bigTableKeyColumnNums: [4] |
|                           bigTableKeyExpressions:
IfExprLongScalarLongColumn(col 3:boolean, val 44, col 0:int)(children:
IsNull(col 0:int) -> 3:boolean) -> 4:int |
|                           bigTableOuterKeyMapping: 4 -> 5 |
|                           bigTableRetainedColumnNums: [0, 5] |
|                           bigTableValueColumnNums: [0] |
|                           className: VectorMapJoinOuterLongOperator |
|                           native: true             |
|                           nativeConditionsMet:
hive.mapjoin.optimized.hashtable IS true,
hive.vectorized.execution.mapjoin.native.enabled IS true,
hive.execution.engine tez IN [tez, spark] IS true, One MapJoin Condition IS
true, No nullsafe IS true, Small table vectorizes IS true, Outer Join has
keys IS true, Fast Hash Table and No Hybrid Hash Join IS true |
|                           projectedOutputColumnNums: [0, 5, 6] |
|                           smallTableMapping: [6]   |
|                       outputColumnNames: _col0, _col1, _col2 |
|                       input vertices:              |
|                         1 Map 2                    |
|                       Statistics: Num rows: 5 Data size: 52 Basic stats:
COMPLETE Column stats: COMPLETE |
|                       File Output Operator         |
|                         compressed: false          |
|                         File Sink Vectorization:   |
|                             className: VectorFileSinkOperator |
|                             native: false          |
|                         Statistics: Num rows: 5 Data size: 52 Basic
stats: COMPLETE Column stats: COMPLETE |
|                         table:                     |
|                             input format:
org.apache.hadoop.mapred.SequenceFileInputFormat |
|                             output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                             serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|             Execution mode: vectorized             |
|             Map Vectorization:                     |
|                 enabled: true                      |
|                 enabledConditionsMet:
hive.vectorized.use.vectorized.input.format IS true |
|                 inputFormatFeatureSupport: [DECIMAL_64] |
|                 featureSupportInUse: [DECIMAL_64]  |
|                 inputFileFormats:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat |
|                 allNative: false                   |
|                 usesVectorUDFAdaptor: false        |
|                 vectorized: true                   |
|                 rowBatchContext:                   |
|                     dataColumnCount: 2             |
|                     includeColumns: [0]            |
|                     dataColumns: key:int, a:int    |
|                     partitionColumnCount: 0        |
|                     scratchColumnTypeNames: [bigint, bigint, bigint,
bigint] |
|         Map 2                                      |
|             Map Operator Tree:                     |
|                 TableScan                          |
|                   alias: dict                      |
|                   Statistics: Num rows: 2 Data size: 16 Basic stats:
COMPLETE Column stats: COMPLETE |
|                   TableScan Vectorization:         |
|                       native: true                 |
|                       vectorizationSchemaColumns: [0:key:int, 1:b:int,
2:ROW__ID:struct<writeid:bigint,bucketid:int,rowid:bigint>] |
|                   Select Operator                  |
|                     expressions: key (type: int), b (type: int) |
|                     outputColumnNames: _col0, _col1 |
|                     Select Vectorization:          |
|                         className: VectorSelectOperator |
|                         native: true               |
|                         projectedOutputColumnNums: [0, 1] |
|                     Statistics: Num rows: 2 Data size: 16 Basic stats:
COMPLETE Column stats: COMPLETE |
|                     Reduce Output Operator         |
|                       key expressions: _col0 (type: int) |
|                       sort order: +                |
|                       Map-reduce partition columns: _col0 (type: int) |
|                       Reduce Sink Vectorization:   |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                           className: VectorReduceSinkLongOperator |
|                           keyColumnNums: [0]       |
|                           native: true             |
|                           nativeConditionsMet:
hive.vectorized.execution.reducesink.new.enabled IS true,
hive.execution.engine tez IN [tez, spark] IS true, No PTF TopN IS true, No
DISTINCT columns IS true, BinarySortableSerDe for keys IS true,
LazyBinarySerDe for values IS true |
|                           valueColumnNums: [1]     |
|                       Statistics: Num rows: 2 Data size: 16 Basic stats:
COMPLETE Column stats: COMPLETE |
|                       value expressions: _col1 (type: int) |
|             Execution mode: vectorized             |
|             Map Vectorization:                     |
|                 enabled: true                      |
|                 enabledConditionsMet:
hive.vectorized.use.vectorized.input.format IS true |
|                 inputFormatFeatureSupport: [DECIMAL_64] |
|                 featureSupportInUse: [DECIMAL_64]  |
|                 inputFileFormats:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat |
|                 allNative: true                    |
|                 usesVectorUDFAdaptor: false        |
|                 vectorized: true                   |
|                 rowBatchContext:                   |
|                     dataColumnCount: 2             |
|                     includeColumns: [0, 1]         |
|                     dataColumns: key:int, b:int    |
|                     partitionColumnCount: 0        |
|                     scratchColumnTypeNames: []     |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

On Sun, Dec 23, 2018 at 4:11 AM Gopal Vijayaraghavan <gopalv@apache.org>
wrote:

> Hi,
>
> > Subject: Re: hive 3.1 mapjoin with complex predicate produce incorrect
> results
> ...
> > |                         0 if(_col0 is null, 44, _col0) (type: int) |
> > |                         1 _col0 (type: int)        |
>
> That rewrite is pretty neat, but I feel like the IF expression nesting is
> what is broken here.
>
> Can you run the same query with "set
> hive.vectorized.reuse.scratch.columns=false;" and see if this is a join
> expression column reuse problem.
>
> If that does work, can you send out a
>
> explain vectorization detail <query>;
>
> I'll eventually get back to my dev env in a week, but this looks like a
> low-level exec issue right now.
>
> Cheers,
> Gopal
>
>
>

-- 
С уважением
Зиновьев Андрей

Mime
View raw message