hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth Jayachandran (JIRA)" <>
Subject [jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark
Date Wed, 30 Aug 2017 08:53:01 GMT


Prasanth Jayachandran commented on HIVE-17280:

That is certainly not the format that hive expects. After concatenation, merged and unmerged
(incompatible) files gets moved to a staging directory. Then MoveTask moves the files from
staging directory to final destination directory (which is also the source directory in case
of concatenation). There are certain assumptions around filenames for bucketing, speculative
execution etc. in move task. In the example files that you had provided, part-00000_copy_1
and part-00001_copy_1 will be considered same file written by different tasks (from speculative
execution) and the largest file will be picked as the winner of speculated execution. This
is the same issue as HIVE-17403. Hive usually writes files with format 000000_0 where 000000
is task id/bucket id and digit after _  is considered a task attempt. I am working on a patch
that will restrict concatenation for external tables. And for hive managed tables, load data
command will make sure the filenames conform to Hive's expectation. 

> Data loss in CONCATENATE ORC created by Spark
> ---------------------------------------------
>                 Key: HIVE-17280
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Spark
>    Affects Versions: 1.2.1
>         Environment: Spark 1.6.3
>            Reporter: Marco Gaido
>            Priority: Critical
> Hive concatenation causes data loss if the ORC files in the table were written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 rows* in
the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.

This message was sent by Atlassian JIRA

View raw message