hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szehon Ho (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-13300) Hive on spark throws exception for multi-insert
Date Fri, 18 Mar 2016 02:25:33 GMT

     [ https://issues.apache.org/jira/browse/HIVE-13300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Szehon Ho updated HIVE-13300:
-----------------------------
    Attachment: HIVE-13300.patch

It happens when there is a multi-insert with a join.

Multi-insert on Spark clones part of the operator-tree and runs it for each of the insert
targets.  Spark will pass the same input keys to each operator tree (not always, but will
happen if they run on the same executor).  Just so happens that current join operators modify
the input by stripping the tag.  So one operator tree will corrupt subsequent ones.

One fix is to make a copy of the input in this scenario before stripping the tag.  Not sure
if there is a better long term fix.

> Hive on spark throws exception for multi-insert
> -----------------------------------------------
>
>                 Key: HIVE-13300
>                 URL: https://issues.apache.org/jira/browse/HIVE-13300
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 2.0.0
>            Reporter: Szehon Ho
>            Assignee: Szehon Ho
>         Attachments: HIVE-13300.patch
>
>
> For certain multi-insert queries, Hive on Spark throws a deserialization error.
> {noformat}
> create table status_updates(userid int,status string,ds string);
> create table profiles(userid int,school string,gender int);
> drop table school_summary; create table school_summary(school string,cnt int) partitioned
by (ds string);
> drop table gender_summary; create table gender_summary(gender int,cnt int) partitioned
by (ds string);
> insert into status_updates values (1, "status_1", "2016-03-16");
> insert into profiles values (1, "school_1", 0);
> set hive.auto.convert.join=false;
> set hive.execution.engine=spark;
> FROM (SELECT a.status, b.school, b.gender
> FROM status_updates a JOIN profiles b
> ON (a.userid = b.userid and
> a.ds='2009-03-20' )
> ) subq1
> INSERT OVERWRITE TABLE gender_summary
> PARTITION(ds='2009-03-20')
> SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender
> INSERT OVERWRITE TABLE school_summary
> PARTITION(ds='2009-03-20')
> SELECT subq1.school, COUNT(1) GROUP BY subq1.school
> {noformat}
> Error:
> {noformat}
> 16/03/17 13:29:00 [task-result-getter-3]: WARN scheduler.TaskSetManager: Lost task 0.0
in stage 2.0 (TID 3, localhost): java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
Hive Runtime Error: Unable to deserialize reduce input key from x1x128x0x0 with properties
{serialization.sort.order.null=a, columns=reducesinkkey0, serialization.lib=org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe,
serialization.sort.order=+, columns.types=int}
> 	at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:279)
> 	at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:49)
> 	at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
> 	at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
> 	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
> 	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:724)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error: Unable
to deserialize reduce input key from x1x128x0x0 with properties {serialization.sort.order.null=a,
columns=reducesinkkey0, serialization.lib=org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe,
serialization.sort.order=+, columns.types=int}
> 	at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:251)
> 	... 12 more
> Caused by: org.apache.hadoop.hive.serde2.SerDeException: java.io.EOFException
> 	at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:241)
> 	at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:249)
> 	... 12 more
> Caused by: java.io.EOFException
> 	at org.apache.hadoop.hive.serde2.binarysortable.InputByteBuffer.read(InputByteBuffer.java:54)
> 	at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserializeInt(BinarySortableSerDe.java:597)
> 	at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:288)
> 	at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:237)
> 	... 13 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message