Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of pat.ferrel@gmail.com designates
 209.85.220.43 as permitted sender)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 8.0 \(1990.1\))
Subject: Re: Mahout 1.0: parallelism/number tasks during
 SimilarityAnalysis.rowSimilarity
From: Pat Ferrel <pat.ferrel@gmail.com>
In-Reply-To: 
 <CAJwFCa1-opMPGjGi9YqRoS2N8m77nxBN0tXheO9z+zrm_pZpEg@mail.gmail.com>
Date: Sat, 15 Nov 2014 09:33:01 -0800
Content-Transfer-Encoding: quoted-printable
Message-Id: <4726232D-2239-4DE2-928D-D4AEAEB32AC3@gmail.com>
References: <543BF61D.7050906@orbit-x.de>
 <CAJwFCa0yvy=HheO_5un+9oukyCq6Y7-3bbcUC=iWvzWVCmm1pg@mail.gmail.com>
 <543BFE96.9050701@orbit-x.de>
 <CAJwFCa1-opMPGjGi9YqRoS2N8m77nxBN0tXheO9z+zrm_pZpEg@mail.gmail.com>
To: "user@mahout.apache.org" <user@mahout.apache.org>

I=E2=80=99ll add a new option to escape any spark options and put them =
directly into the SparkConf for the job before the context is created.

The CLI will be something like -D xxx=3Dyyy so for this case you can =
change the default parallelism with=20

-D spark.default.parallelism=3D400

If the logic holds that you can often have 16 to 8 x your number of =
cores then running locally on my laptop with local[7] should have -D =
spark.default.parallelism=3D112 or 56

If you want this value set for your entire cluster you should be able to =
set it in the conf files when you launch the cluster. We don=E2=80=99t =
change any of those values in the client except spark.executor.memory =
(only if specified) and any escaped values.=20

On Oct 13, 2014, at 11:32 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

On Mon, Oct 13, 2014 at 12:32 PM, Reinis Vicups <mahout@orbit-x.de> =
wrote:

>=20
> Do you think that simply increasing this parameter is a safe and sane
>> thing
>> to do?
>>=20
>=20
> Why would it be unsafe?
>=20
> In my own implementation I am using 400 tasks on my 4-node-2cpu =
cluster
> and the execution times of largest shuffle stage have dropped around =
10
> times.
> I have number of test values back from the time when I used "old"
> RowSimilarityJob and with some exceptions (I guess due to randomized
> sparsization) I still have approx. the same values with my own row
> similarity implementation.
>=20

Splitting things too far can make processes much less efficient.  =
Setting
parameters like this may propagate further than desired.

I asked because I don't know, however.