spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: ALS memory limits
Date Wed, 26 Mar 2014 08:45:03 GMT
Much of this sounds related to the memory issue mentioned earlier in this
thread. Are you using a build that has fixed that? That would be by far
most important here.

If the raw memory requirement is 8GB, the actual heap size necessary could
be a lot larger -- object overhead, all the other stuff in memory,
overheads within the heap allocation, etc. So I would expect total memory
requirement to be significantly more than 9GB.

Still, this is the *total* requirement across the cluster. Each worker is
just loading part of the matrix. If you have 10 workers I would imagine it
roughly chops the per-worker memory requirement by 10x.

This in turn depends on also letting workers use more than their default
amount of memory. May need to increase executor memory here.

Separately, I have observed issues with too many files open and lots of
/tmp files. You may have to use ulimit to increase the number of open files
allowed.

On Wed, Mar 26, 2014 at 6:06 AM, Debasish Das <debasish.das83@gmail.com>wrote:

> Hi,
>
> For our usecases we are looking into 20 x 1M matrices which comes in the
> similar ranges as outlined by the paper over here:
>
> http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html
>
> Is the exponential runtime growth in spark ALS as outlined by the blog
> still exists in recommendation.ALS ?
>
> I am running a spark cluster of 10 nodes with total memory of around 1 TB
> with 80 cores....
>
> With rank = 50, the memory requirements for ALS should be 20Mx50 doubles on
> every worker which is around 8 GB....
>
> Even if both the factor matrices are cached in memory I should be bounded
> by ~ 9 GB but even with 32 GB per worker I see GC errors...
>
> I am debugging the scalability and memory requirements of the algorithm
> further but any insights will be very helpful...
>
> Also there are two other issues:
>
> 1. If GC errors are hit, that worker JVM goes down and I have to restart it
> manually. Is this expected ?
>
> 2. When I try to make use of all 80 cores on the cluster I get some issues
> related to java.io.File not found exception on /tmp/ ? Is there some OS
> limit that how many cores can simultaneously access /tmp from a process ?
>
> Thanks.
> Deb
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message