spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roshani Nagmote <>
Subject Re: Spark MLlib ALS algorithm
Date Sun, 25 Sep 2016 23:19:39 GMT

I ran ALS algorithm on 30 c4.8xlarge machines(60GB RAM each) with
dataset(1.4GB) Netflix dataset (Users: 480189, Items: 17770, Ratings: 99M)

*Command* I run:

/usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn  --jars
/usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar netflixals_2.11-1.0.jar
--rank 200 --numIterations 30 --lambda 5e-3 --kryo s3://netflix_train
I get following *error*:

Job aborted due to stage failure: Task 625 in stage 28.0 failed 4 times,
most recent failure: Lost task 625.3 in stage 28.0 (TID 9362, ip.ec2):
(No space left on device)

I did set checkpointdir in S3 and have used checkpoint interval as 5.
Dataset is very small. So, I don't know why it won't run on 30 nodes spark
EMR cluster and it runs out of space.
Can anyone please help me with this?


On Fri, Sep 23, 2016 at 11:50 PM, Nick Pentreath <>

> The scale factor was only to scale up the number of ratings in the dataset
> for performance testing purposes, to illustrate the scalability of Spark
> ALS.
> It is not something you would normally do on your training dataset.
> On Fri, 23 Sep 2016 at 20:07, Roshani Nagmote <>
> wrote:
>> Hello,
>> I was working on Spark MLlib ALS Matrix factorization algorithm and came
>> across the following blog post:
>> collaborative-filtering-with-spark-mllib.html
>> Can anyone help me understanding what "s" scaling factor does and does it
>> really give better performance? What's the significance of this?
>> If we convert input data to scaledData with the help of "s", will it
>> speedup the algorithm?
>> Scaled data usage:
>> *(For each user, we create pseudo-users that have the same ratings. That
>> is, for every rating as (userId, productId, rating), we generate (userId+i,
>> productId, rating) where 0 <= i < s and s is the scaling factor)*
>> Also, this blogpost is for spark 1.1 and I am currently using 2.0
>> Any help will be greatly appreciated.
>> Thanks,
>> Roshani

View raw message