hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-18763) VectorMapOperator should take into account partition->table serde conversion for all cases
Date Thu, 22 Feb 2018 19:13:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-18763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373272#comment-16373272
] 

Sergey Shelukhin commented on HIVE-18763:
-----------------------------------------

Hey, I was planning to work on it, but not for a while (after some other bugs), so feel free
if you want to.
My thinking after looking at the row converters was that the actual changes for most types
are simple and insignificant (varchar lengths is one), so the change would be mostly wiring.
It should be easy to add a separate VRB converter rather than trying to shoehorn this into
VectorDeserializeRow or OI converters, since both of these deal with rows (row->row or
row->VRB). All the necessary schema and stuff are available in VectorMapOperator - it is
already used for the non-native-VRB case.
Then making sure it doesn't apply to cases that do their own conversion (the only one right
now is ORC) is another small wiring issue. 
Under current plan I have for stuff I can do it next week probably... 

> VectorMapOperator should take into account partition->table serde conversion for all
cases
> ------------------------------------------------------------------------------------------
>
>                 Key: HIVE-18763
>                 URL: https://issues.apache.org/jira/browse/HIVE-18763
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> When table and partition schema differ, non-vectorized MapOperator does row by row conversion
from whatever is read to the table schema.
> VectorMapOperator is less consistent... it does the conversion as part of populating
VRBs in row/serde modes (used to read e.g. text row-by-row or natively, and make VRBs); see
 VectorDeserializeRow class convert... methods for an example. However, the native VRB mode
relies on ORC ConvertTreeReader... stuff that lives in ORC, and so never converts anything
nside VMO.
> So, anything running in native VRB mode that is not the vanilla ORC reader will produce
data with incorrect schema if there were schema changes and partitions are present  - there
are two such cases right now, LLAP IO with ORC or text data, and Parquet. 
> It's possible to extend ConvertTreeReader... stuff to LLAP IO ORC that already uses TreeReader-s
for everything; LLAP IO text and (non-LLAP) Parquet, as well as any future users however will
have to invent their own conversion.
> Therefore, I think the best fix for this is to treat all inputs in VMO the same and convert
them by default, like the regular MapOperator; and make ORC special mode an exception that
allows it to bypass the conversion. 
> cc [~mmccline]
> Test case - varchar column length should be limited after alter table but it isn't.
> {noformat}
> CREATE TABLE schema_evolution_data(insert_num int, boolean1 boolean, tinyint1 tinyint,
smallint1 smallint, int1 int, bigint1 bigint, decimal1 decimal(38,18), float1 float, double1
double, string1 varchar(50), string2 varchar(50), date1 date, timestamp1 timestamp, boolean_str
string, tinyint_str string, smallint_str string, int_str string, bigint_str string, decimal_str
string, float_str string, double_str string, date_str string, timestamp_str string, filler
string)
> row format delimited fields terminated by '|' stored as textfile;
> load data local inpath '../../data/files/schema_evolution/schema_evolution_data.txt'
overwrite into table schema_evolution_data;
> drop table if exists vsp;
> create table vsp(vs varchar(50)) partitioned by(s varchar(50)) stored as textfile;
> insert into table vsp partition(s='positive') select string1 from schema_evolution_data;
> alter table vsp change column vs vs varchar(3);
> drop table if exists vsp_orc;
> create table vsp_orc(vs varchar(50)) partitioned by(s varchar(50)) stored as orc;
> insert into table vsp_orc partition(s='positive') select string1 from schema_evolution_data;
> alter table vsp_orc change column vs vs varchar(3);
> drop table if exists vsp_parquet;
> create table vsp_parquet(vs varchar(50)) partitioned by(s varchar(50)) stored as parquet;
> insert into table vsp_parquet partition(s='positive') select string1 from schema_evolution_data;
> alter table vsp_parquet change column vs vs varchar(3);
> SET hive.llap.io.enabled=true;
> -- BAD results from all queries; parquet affected regardless of IO.
> select length(vs) from vsp; 
> select length(vs) from vsp_orc;
> select length(vs) from vsp_parquet;
> SET hive.llap.io.enabled=false;
> select length(vs) from vsp; -- ok
> select length(vs) from vsp_orc; -- ok
> select length(vs) from vsp_parquet; -- still bad
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message