hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth Jayachandran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark
Date Tue, 05 Sep 2017 21:51:06 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154377#comment-16154377
] 

Prasanth Jayachandran commented on HIVE-17280:
----------------------------------------------

[~mgaido] Posted a patch to HIVE-17280 that will fix the issue (along with adding restrictions).
Tested this locally and it worked. If concatenation finds incompatible file, it will rename
to Hive's convention to avoid the issue that I mentioned above. 

> Data loss in CONCATENATE ORC created by Spark
> ---------------------------------------------
>
>                 Key: HIVE-17280
>                 URL: https://issues.apache.org/jira/browse/HIVE-17280
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Spark
>    Affects Versions: 1.2.1
>         Environment: Spark 1.6.3
>            Reporter: Marco Gaido
>            Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 rows* in
the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message