spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-20187) Replace loadTable with moveFile to speed up load table for many output files
Date Sat, 01 Apr 2017 10:36:41 GMT

    [ https://issues.apache.org/jira/browse/SPARK-20187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952169#comment-15952169
] 

Apache Spark commented on SPARK-20187:
--------------------------------------

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/17505

> Replace loadTable with moveFile to speed up load table for many output files
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-20187
>                 URL: https://issues.apache.org/jira/browse/SPARK-20187
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Yuming Wang
>
> [HiveClientImpl.loadTable|https://github.com/apache/spark/blob/v2.1.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L667]
load files one by one, so this step will take a long time if a job generates many files. There
is a [Hive.moveFile api|https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2567]
can speed up this step for {{create table tableName as select ...}} and {{insert overwrite
table tableName select ...}}
> {noformat:align=left|title=loadTable api: It took about 26 minutes(10:50:14 - 11:16:18)
to load table}
> 17/04/01 10:50:04 INFO TaskSetManager: Finished task 207165.0 in stage 0.0 (TID 216796)
in 5952 ms on jqhadoop-test28-8.int.yihaodian.com (executor 54) (216869/216869)
> 17/04/01 10:50:04 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed,
from pool 
> 17/04/01 10:50:04 INFO DAGScheduler: ResultStage 0 (processCmd at CliDriver.java:376)
finished in 541.797 s
> 17/04/01 10:50:04 INFO DAGScheduler: Job 0 finished: processCmd at CliDriver.java:376,
took 551.208919 s
> 17/04/01 10:50:04 INFO FileFormatWriter: Job null committed.
> 17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
Status:true
> 17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
Status:true
> ...
> 17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
Status:true
> 17/04/01 11:16:18 INFO SparkSqlParser: Parsing command: `tmp`.`spark_load_slow`
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> Time taken: 2178.736 seconds
> 17/04/01 11:16:18 INFO CliDriver: Time taken: 2178.736 seconds
> {noformat}
> {noformat:align=left|title=moveFile api: It took about 9 minutes(13:24:39 - 13:33:46)
to load table|align=right}
> 17/04/01 13:24:38 INFO TaskSetManager: Finished task 210610.0 in stage 0.0 (TID 216829)
in 5888 ms on jqhadoop-test28-28.int.yihaodian.com (executor 59) (216869/216869)
> 17/04/01 13:24:38 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed,
from pool 
> 17/04/01 13:24:38 INFO DAGScheduler: ResultStage 0 (processCmd at CliDriver.java:376)
finished in 532.409 s
> 17/04/01 13:24:38 INFO DAGScheduler: Job 0 finished: processCmd at CliDriver.java:376,
took 539.337610 s
> 17/04/01 13:24:39 INFO FileFormatWriter: Job null committed.
> 17/04/01 13:24:39 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_13-14-46_099_8962745596360417817-1/-ext-10000,
dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow_movefile, Status:true
> 17/04/01 13:33:46 INFO SparkSqlParser: Parsing command: `tmp`.`spark_load_slow_movefile`
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> Time taken: 1142.671 seconds
> 17/04/01 13:33:46 INFO CliDriver: Time taken: 1142.671 seconds
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message