spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject Re: Dataframe aggregation with Tungsten unsafe
Date Wed, 26 Aug 2015 01:04:57 GMT
There are a lot of GC activity due to the non-code-gen path being sloppy
about garbage creation. This is not actually what happens, but just as an
example: { i: Int => i + 1 }

This under the hood becomes a closure that boxes on every input and every
output, creating two extra objects.

The reality is more complicated than this -- but here's a simpler view of
what happens with GC in these cases. You might've heard from other places
that the JVM is very efficient about transient object allocations. That is
true when you look at these allocations in isolation, but unfortunately not
true when you look at them in aggregate.

First, due to the way the iterator interface is constructed, it is hard for
the JIT compiler to on-stack allocate these objects. Then two things happen:

1. They pile up and cause more young gen GCs to happen.
2. After a few young gen GCs, some mid-tenured objects (e.g. an aggregation
map) get copied into the old-gen, and eventually requires a full GC to free
them. Full GCs are much more expensive than young gen GCs (usually involves
copying all the data in the old gen).

So the more garbages that are created -> the more frequently full GC

The more long lived objects in the old gen (e.g. cache) -> the more
expensive full GC is.

On Tue, Aug 25, 2015 at 5:19 PM, Ulanov, Alexander <>

> Thank you for the explanation. The size if the 100M data is ~1.4GB in
> memory and each worker has 32GB of memory. It seems to be a lot of free
> memory available. I wonder how Spark can hit GC with such setup?
> Reynold Xin <<>>
> On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander <
><>> wrote:
> It seems that there is a nice improvement with Tungsten enabled given that
> data is persisted in memory 2x and 3x. However, the improvement is not that
> nice for parquet, it is 1.5x. What’s interesting, with Tungsten enabled
> performance of in-memory data and parquet data aggregation is similar.
> Could anyone comment on this? It seems counterintuitive to me.
> Local performance was not as good as Reynold had. I have around 1.5x, he
> had 5x. However, local mode is not interesting.
> I think a large part of that is coming from the pressure created by JVM
> GC. Putting more data in-memory makes GC worse, unless GC is well tuned.

View raw message