Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A45F4200CF7 for ; Tue, 15 Aug 2017 00:48:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A30B9166010; Mon, 14 Aug 2017 22:48:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CBF1816600D for ; Tue, 15 Aug 2017 00:48:08 +0200 (CEST) Received: (qmail 41575 invoked by uid 500); 14 Aug 2017 22:48:06 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 41415 invoked by uid 99); 14 Aug 2017 22:48:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Aug 2017 22:48:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 7D0941806DF for ; Mon, 14 Aug 2017 22:48:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 8yV2Pi3yKaWl for ; Mon, 14 Aug 2017 22:48:04 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 729705FBB2 for ; Mon, 14 Aug 2017 22:48:04 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id E73A7E0E8C for ; Mon, 14 Aug 2017 22:48:02 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id CD10D21907 for ; Mon, 14 Aug 2017 22:48:01 +0000 (UTC) Date: Mon, 14 Aug 2017 22:48:01 +0000 (UTC) From: "Xiao Li (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SPARK-21721) Memory leak in org.apache.spark.sql.hive.execution.InsertIntoHiveTable MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 14 Aug 2017 22:48:09 -0000 [ https://issues.apache.org/jira/browse/SPARK-21721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21721: ---------------------------- Priority: Critical (was: Major) > Memory leak in org.apache.spark.sql.hive.execution.InsertIntoHiveTable > ---------------------------------------------------------------------- > > Key: SPARK-21721 > URL: https://issues.apache.org/jira/browse/SPARK-21721 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.2, 2.1.1, 2.2.0 > Reporter: yzheng616 > Priority: Critical > > The leak came from org.apache.spark.sql.hive.execution.InsertIntoHiveTable. At line 118, it put a staging path to FileSystem delete cache, and then remove the path from disk at line 385. It does not remove the path from FileSystem cache. If a streaming application keep persisting data to a partitioned hive table, the memory will keep increasing until JVM terminated. > Below is a simple code to reproduce it. > {code:java} > package test > import org.apache.spark.sql.SparkSession > import org.apache.hadoop.fs.Path > import org.apache.hadoop.fs.FileSystem > import org.apache.spark.sql.SaveMode > import java.lang.reflect.Field > case class PathLeakTest(id: Int, gp: String) > object StagePathLeak { > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder().master("local[4]").appName("StagePathLeak").enableHiveSupport().getOrCreate() > spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") > //create a partitioned table > spark.sql("drop table if exists path_leak"); > spark.sql("create table if not exists path_leak(id int)" + > " partitioned by (gp String)"+ > " row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+ > " stored as"+ > " inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+ > " outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'") > var seq = new scala.collection.mutable.ArrayBuffer[PathLeakTest]() > // 2 partitions > for (x <- 1 to 2) { > seq += (new PathLeakTest(x, "g" + x)) > } > val rdd = spark.sparkContext.makeRDD[PathLeakTest](seq) > //insert 50 records to Hive table > for (j <- 1 to 50) { > val df = spark.createDataFrame(rdd) > //#1 InsertIntoHiveTable line 118: add stage path to FileSystem deleteOnExit cache > //#2 InsertIntoHiveTable line 385: delete the path from disk but not from the FileSystem cache, and it caused the leak > df.write.mode(SaveMode.Overwrite).insertInto("path_leak") > } > > val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration) > val deleteOnExit = getDeleteOnExit(fs.getClass) > deleteOnExit.setAccessible(true) > val caches = deleteOnExit.get(fs).asInstanceOf[java.util.TreeSet[Path]] > //check FileSystem deleteOnExit cache size > println(caches.size()) > val it = caches.iterator() > //all starge pathes were still cached even they have already been deleted from the disk > while(it.hasNext()){ > println(it.next()); > } > } > > def getDeleteOnExit(cls: Class[_]) : Field = { > try{ > return cls.getDeclaredField("deleteOnExit") > }catch{ > case ex: NoSuchFieldException => return getDeleteOnExit(cls.getSuperclass) > } > return null > } > } > {code} > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org