spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
Date Wed, 09 Nov 2016 00:52:58 GMT

     [ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-18372:
------------------------------------

    Assignee:     (was: Apache Spark)

> .Hive-staging folders created from Spark hiveContext are not getting cleaned up
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18372
>                 URL: https://issues.apache.org/jira/browse/SPARK-18372
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.2, 1.6.3
>         Environment: spark standalone and spark yarn 
>            Reporter: mingjie tang
>             Fix For: 2.0.1
>
>
> Steps to reproduce:
> ================
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid
string, querytime string, market string, deviceplatform string, devicemake string, devicemodel
string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder
bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime,
deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake
= 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666

> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000

> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =====
> with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml
- didn't make any difference.
> .hive-staging folders are created under the <TableName> folder - hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the components -
> <property> 
> <name>hive.exec.stagingdir</name> 
> <value>$ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging</value> 
> </property>
> a new .hive-staging folder was created in hive/warehouse/<tablename> folder
> moreover, please understand that if we run the hive query in pure Hive via Hive CLI on
the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already
> The issue happens via Spark-submit as well - customer used the following command to reproduce
this -
> spark-submit test-hive-staging-cleanup.py
> Solution: 
> This bug is reported by customers.
> The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of (org.apache.hadoop.hive.)
to create the staging directory. Default, from the hive side, this staging file would be removed
after the hive session is expired. However, spark fail to notify the hive to remove the staging
files.
> Thus, follow the code of spark 2.0.x, I just write one function inside the InsertIntoHiveTable
to create the .staging directory, then, after the session expired of spark, this .staging
directory would be removed.
> This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request is :

> For the test, I have manually checking .staging files from table belong directory after
the spark shell close. meanwhile, please advise how to write the test case? because the directory
for the related tables can not get.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message