hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-8474) Vectorized reads of transactional tables fail when not all columns are selected
Date Fri, 17 Oct 2014 22:59:35 GMT

     [ https://issues.apache.org/jira/browse/HIVE-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alan Gates updated HIVE-8474:
-----------------------------
    Attachment: HIVE-8474.patch

This patch makes several changes in vectorization.  [~mmccline] and [~ashutoshc], as I am
not very familiar with this code and as I know the code is very performance sensitive I would
appreciate your feedback on the patch.

The issue causing problems was that VectorizedBatchUtil.addRowToBatchFrom is used by VectorizedOrcAcidRowReader
to take the merged rows from and acid read and put them in a vector batch.  But this method
appears to have been built to be used by vector operators, not file formats where columns
may be missing because they have been projected out or may already have values set as they
are partition columns.  So I made the following changes:
# I changed addRowToBatchFrom to skip writing values into ColumnVectors that are null.  This
handles the case where columns have been projected out and thus the ColumnVector is null.
# I changed VectorizedRowBatch to have a boolean array to track which columns are partition
columns and VectorizedRowBatchCtx.createVectorizedRowBatch to populate this array
# I changed addRowToBatchFrom to skip writing values into ColumnVectors that are marked in
VectorizedRowBatch as partition columns, since this results in overwriting the values that
have already been put there by VectorizedRowBatchCtx.addPartitionColumnsToBatch

My concern is whether it is appropriate to mix in this functionality to skip projected out
and partition columns into addRowToBatchFrom.  If you think it isn't good, I can write a new
method to do this.  But that will involve a fair amount of duplicate code.  

[~owen.omalley], I also changed VectorizedOrcAcidRowReader to set the partition column values
after every call to VectorizedRowBatch.reset in next.  Without doing this the code was NPEing
later in the pipeline because the partition column had been set to null.  It appeared that
you had copied the code from VectorizedOrcInputFormat, which only called addPartitionColsToBatch
once, but which never called reset.  I tried removing the call to reset but that caused other
issues.

> Vectorized reads of transactional tables fail when not all columns are selected
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-8474
>                 URL: https://issues.apache.org/jira/browse/HIVE-8474
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions, Vectorization
>    Affects Versions: 0.14.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: 0.14.0
>
>         Attachments: HIVE-8474.patch
>
>
> {code}
> create table concur_orc_tab(name varchar(50), age int, gpa decimal(3, 2)) clustered by
(age) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true');
> select name, age from concur_orc_tab order by name;
> {code}
> results in
> {code}
> Diagnostic Messages for this Task:
> Error: java.io.IOException: java.lang.NullPointerException
>         at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>         at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>         at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:352)
>         at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>         at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>         at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:115)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>         at org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.setNullColIsNullValue(VectorizedBatchUtil.java:63)
>         at org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.addRowToBatchFrom(VectorizedBatchUtil.java:443)
>         at org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.addRowToBatch(VectorizedBatchUtil.java:214)
>         at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:95)
>         at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:43)
>         at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:347)
>         ... 13 more
> {code}
> The issue is that the object inspector passed to VectorizedOrcAcidRowReader has all of
the columns in the file rather than only the projected columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message