spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <alexander.ula...@hp.com>
Subject Re: Dataframe aggregation with Tungsten unsafe
Date Wed, 26 Aug 2015 00:19:15 GMT
Thank you for the explanation. The size if the 100M data is ~1.4GB in memory and each worker
has 32GB of memory. It seems to be a lot of free memory available. I wonder how Spark can
hit GC with such setup?

Reynold Xin <rxin@databricks.com<mailto:rxin@databricks.com>>


On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander <alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>
wrote:

It seems that there is a nice improvement with Tungsten enabled given that data is persisted
in memory 2x and 3x. However, the improvement is not that nice for parquet, it is 1.5x. What’s
interesting, with Tungsten enabled performance of in-memory data and parquet data aggregation
is similar. Could anyone comment on this? It seems counterintuitive to me.

Local performance was not as good as Reynold had. I have around 1.5x, he had 5x. However,
local mode is not interesting.


I think a large part of that is coming from the pressure created by JVM GC. Putting more data
in-memory makes GC worse, unless GC is well tuned.




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message