spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <>
Subject Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge
Date Thu, 31 Dec 2015 03:00:08 GMT
Hi Jia,

You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether it
can produce stable performance. The storage level of MEMORY_AND_DISK will
store the partitions that don't fit on disk and read them from there when
they are needed.
Actually, it's not necessary to set so large driver memory in your case,
because KMeans use low memory for driver if your k is not very large.


2015-12-30 22:20 GMT+08:00 Jia Zou <>:

> I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU
> cores and 30GB memory. Executor memory is set to 15GB, and driver memory is
> set to 15GB.
> The observation is that, when input data size is smaller than 15GB, the
> performance is quite stable. However, when input data becomes larger than
> that, the performance will be extremely unpredictable. For example, for
> 15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three
> dramatically different testing results: 27mins, 61mins and 114 mins. (All
> settings are the same for the 3 tests, and I will create input data
> immediately before running each of the tests to keep OS buffer cache hot.)
> Anyone can help to explain this? Thanks very much!

View raw message