spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields
Date Tue, 05 Jul 2016 07:56:10 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362167#comment-15362167
] 

Sean Owen commented on SPARK-16361:
-----------------------------------

Hm, I'm still not sure this is actionable as it's not clear this is taking "too much memory",
because you have a pretty big query (cube) on a medium-sized data set.

> It takes a long time for gc when building cube with  many fields
> ----------------------------------------------------------------
>
>                 Key: SPARK-16361
>                 URL: https://issues.apache.org/jira/browse/SPARK-16361
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.6.2
>            Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
> 		sqlContext.udf().register("jidu", (Integer f) -> {
> 			return (f - 1) / 3 + 1;
> 		} , DataTypes.IntegerType);
> 		DataFrame d = sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as
double) as c_age",
> 				"month(day) as month", "year(day) as year", "cast ((datediff(now(),INTIME)/365+1)
as int ) as zwsc",
> 				"jidu(month(day)) as jidu");
> 		Bucketizer b = new Bucketizer().setInputCol("c_age").setSplits(new double[] { Double.NEGATIVE_INFINITY,
0, 10,
> 				20, 30, 40, 50, 60, 70, 80, 90, 100, Double.POSITIVE_INFINITY }).setOutputCol("age");
> 		DataFrame cube = new Cuber(b.transform(d))
> 				.addFields("day", "AREA_CODE", "CUST_TYPE", "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
> 				.min("age").sum("zwsc").count().buildcube();
> 		cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> Metric	Min	25th percentile	Median	75th percentile	Max
> Duration	2.6 min	2.7 min	2.7 min	2.7 min	2.7 min
> GC Time	1.6 min	1.6 min	1.6 min	1.6 min	1.6 min
> Shuffle Read Size / Records	728.4 KB / 21886	736.6 KB / 22258	738.7 KB / 22387	746.6
KB / 22542	748.6 KB / 22783
> Shuffle Write Size / Records	74.3 MB / 1926282	75.8 MB / 1965860	76.2 MB / 1976004	76.4
MB / 1981516	77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message