spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: MLLib ALS question
Date Tue, 30 Sep 2014 19:01:08 GMT
You may need a cluster with more memory. The current ALS
implementation constructs all subproblems in memory. With rank=10,
that means (6.5M + 2.5M) * 10^2 / 2 * 8 bytes = 3.5GB. The ratings
need 2GB, not counting the overhead. ALS creates in/out blocks to
optimize the computation, which takes about twice as much as the
original dataset. Note that this optimization becomes "overhead" on a
single machine. All these factors contribute to the OOM error.

You can try DISK_ONLY with spark.rdd.compress set to true. In Spark
1.1, we added an option to set the storage level for in/out blocks
(ALS.setIntermediateRDDStorageLevel), which you can use to store
in/out blocks on disk. That being said, I still recommend running the
dataset on a cluster with more memory.

Best,
Xiangrui

On Tue, Sep 30, 2014 at 10:44 AM, Alex T <chiortster@gmail.com> wrote:
> Hi,
> I'm trying to use Matrix Factorization over a dataset with like 6.5M users,
> 2.5M products and 120M ratings over products. The test is done in standalone
> mode, with unique worker (Quad-core and 16 Gb RAM).
>
> The program runs out of memory, and I think that this happens because
> flatMap holds data in memory.
> (I tried with Movielens dataset that has 65k users, 11k movies and 100M
> ratings and the test does it without any problem)
>
> Is there any way to make ALS hold the data on disk, instead of memory?
>
> When I was trying the movielens dataset, i noticed that after all the jobs,
> the program holds some residual RDD in-memory. Why is that?
>
> And last question (general question), why when I persist RDD with
> StorageLevel.DISK_ONLY, unix system monitor shows that Apache Spark uses the
> same amount of RAM, as if I persist it in-memory?
>
> Thanks in advance. Hope that is understandable, since it's not my main
> language.
>
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-question-tp15420.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message