hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Na Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-7870) Insert overwrite table query does not generate correct task plan [Spark Branch]
Date Sat, 06 Sep 2014 06:46:28 GMT

    [ https://issues.apache.org/jira/browse/HIVE-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124358#comment-14124358
] 

Na Yang commented on HIVE-7870:
-------------------------------

Removing those duplicated filesinks is hard because during the time that those filesinks are
added to the filesinkset, it is hard to know which filesink is eventually used by the spark
work. if we remove the wrong filesink from the filesinkset, then we are not able to create
the proper linkedfilesinks for the target filesink. This will cause wrong result for the merge
and move work when hive.merge.sparkfiles is turned ON.  

For example, in the following query, three duplicate filesink  FS1, FS2, FS3 will be added
to the filesinkset. (the number is according to the order they are added to the filesinkset),
FS2 and FS3 will be used for the subqueries of the outer union. In addition, FS2 and FS3 have
different directory when hive.merge.sparkfiles=true.

insert overwrite table outputTbl1
SELECT * FROM
(
select key, 1 as values from inputTbl1
union all
select * FROM (
  SELECT key, count(1) as values from inputTbl1 group by key
  UNION ALL
  SELECT key, 2 as values from inputTbl1
) a
)b;

However, in the following query, same as above query, three duplicate filesink  FS1, FS2,
FS3 will be added to the filesinkset. But FS1 will be used for the subqueries of the union.
FS1, FS2 and FS3 all have the same directory when hive.merge.sparkfiles=true.

insert overwrite table outputTbl1
SELECT * FROM
(
select key, 1 as values from inputTbl1
union all
select * FROM (
  SELECT key, 3 as values from inputTbl1
  UNION ALL
  SELECT key, 2 as values from inputTbl1
) a
)b;

When the filesinks are added to the filesinkset, the final plan has not been generated yet,
so there is no way to know which filesink should not be added to the set. After the final
plan is generated, it is hard to detect the duplicate filesinks and remove the right one either.


Therefore, duplicate filesinks are in the filesinkset. The potential problem that duplicate
filesinks cause is generating multiple merge and move works when hive.merge.sparkfiles=true.
This problem has been resolved in the patch by linking those duplicate filesinks together
and use a HashMap to make sure one directory only gets processed once and only one merge and
move work will be generated for each directory no matter how many duplicate filesinks exist.




> Insert overwrite table query does not generate correct task plan [Spark Branch]
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-7870
>                 URL: https://issues.apache.org/jira/browse/HIVE-7870
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Na Yang
>            Assignee: Na Yang
>              Labels: Spark-M1
>         Attachments: HIVE-7870.1-spark.patch, HIVE-7870.2-spark.patch, HIVE-7870.3-spark.patch,
HIVE-7870.4-spark.patch, HIVE-7870.5-spark.patch
>
>
> Insert overwrite table query does not generate correct task plan when hive.optimize.union.remove
and hive.merge.sparkfiles properties are ON. 
> {noformat}
> set hive.optimize.union.remove=true
> set hive.merge.sparkfiles=true
> insert overwrite table outputTbl1
> SELECT * FROM
> (
> select key, 1 as values from inputTbl1
> union all
> select * FROM (
>   SELECT key, count(1) as values from inputTbl1 group by key
>   UNION ALL
>   SELECT key, 2 as values from inputTbl1
> ) a
> )b;
> select * from outputTbl1 order by key, values;
> {noformat}
> query result
> {noformat}
> 1	1
> 1	2
> 2	1
> 2	2
> 3	1
> 3	2
> 7	1
> 7	2
> 8	2
> 8	2
> 8	2
> {noformat}
> expected result:
> {noformat}
> 1	1
> 1	1
> 1	2
> 2	1
> 2	1
> 2	2
> 3	1
> 3	1
> 3	2
> 7	1
> 7	1
> 7	2
> 8	1
> 8	1
> 8	2
> 8	2
> 8	2
> {noformat}
> Move work is not working properly and some data are missing during move.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message