hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Remus Rusanu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5817) column name to index mapping in VectorizationContext is broken
Date Mon, 25 Nov 2013 22:04:35 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13831968#comment-13831968
] 

Remus Rusanu commented on HIVE-5817:
------------------------------------

My patch .4 addresses the issue the following manner:
 
 - vector operators can implement optional interface VectorizationContextRegion. If they do,
they must provide a new vectorization context to be used by child operators. In my patch only
VectorMapJoinOperator does so.
 - vectorizer walks up the stack of parent nodes to locate the first one (last one?) that
created a vectorization context, and this is the vectorization context used to vectorize the
current node. At the root of the stack there is a table scan that always creates a vectorization
context. 
 - I made the VectorMapJoinOperator build the output VectorizedRowBatch using the VectorizedRowBatchCtx
class, same as ORC and RC scanners do. This is more consistent and removes the need for the
VectorizedRowBatch.buildBatch method (was used only by VMJ)
 - add a simplified init to VectorizedRowBatchCtx  to be used by VMJ (or any other operator
we decide).

I did not enable yet 'submit patch' because more code can be removed  (the mapper scratch
for vector type map) , code that was use donly by VMJ to enable it to build the output batch.
Using VectorizedRowBatchCtx  makes all that code obsolete.

I tested the repro query and passes fine, produces 100 rows (I assume they're the right ones...).
I will do some more testing.

> column name to index mapping in VectorizationContext is broken
> --------------------------------------------------------------
>
>                 Key: HIVE-5817
>                 URL: https://issues.apache.org/jira/browse/HIVE-5817
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>            Reporter: Sergey Shelukhin
>            Assignee: Remus Rusanu
>            Priority: Critical
>         Attachments: HIVE-5817-uniquecols.broken.patch, HIVE-5817.00-broken.patch, HIVE-5817.4.patch
>
>
> Columns coming from different operators may have the same internal names ("_colNN").
There exists a query in the form {{select b.cb, a.ca from a JOIN b ON ... JOIN x ON ...;}}
 (distilled from a more complex query), which runs ok w/o vectorization. With vectorization,
it will run ok for most ca, but for some ca it will fail (or can probably return incorrect
results). That is because when building column-to-VRG-index map in VectorizationContext, internal
column name for ca that the first map join operator adds to the mapping may be the same as
internal name for cb that the 2nd one tries to add. 2nd VMJ doesn't add it (see code in ctor),
and when it's time for it to output stuff, it retrieves wrong index from the map by name,
and then wrong vector from VRG.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message