mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity
Date Sat, 15 Nov 2014 17:33:01 GMT
I’ll add a new option to escape any spark options and put them directly into the SparkConf
for the job before the context is created.

The CLI will be something like -D xxx=yyy so for this case you can change the default parallelism
with 

-D spark.default.parallelism=400

If the logic holds that you can often have 16 to 8 x your number of cores then running locally
on my laptop with local[7] should have -D spark.default.parallelism=112 or 56

If you want this value set for your entire cluster you should be able to set it in the conf
files when you launch the cluster. We don’t change any of those values in the client except
spark.executor.memory (only if specified) and any escaped values. 

On Oct 13, 2014, at 11:32 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

On Mon, Oct 13, 2014 at 12:32 PM, Reinis Vicups <mahout@orbit-x.de> wrote:

> 
> Do you think that simply increasing this parameter is a safe and sane
>> thing
>> to do?
>> 
> 
> Why would it be unsafe?
> 
> In my own implementation I am using 400 tasks on my 4-node-2cpu cluster
> and the execution times of largest shuffle stage have dropped around 10
> times.
> I have number of test values back from the time when I used "old"
> RowSimilarityJob and with some exceptions (I guess due to randomized
> sparsization) I still have approx. the same values with my own row
> similarity implementation.
> 

Splitting things too far can make processes much less efficient.  Setting
parameters like this may propagate further than desired.

I asked because I don't know, however.


Mime
View raw message