Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2B2E6C7BA for ; Sat, 15 Nov 2014 17:34:19 +0000 (UTC) Received: (qmail 88091 invoked by uid 500); 15 Nov 2014 17:34:17 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 88026 invoked by uid 500); 15 Nov 2014 17:34:17 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 88012 invoked by uid 99); 15 Nov 2014 17:34:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Nov 2014 17:34:16 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of pat.ferrel@gmail.com designates 209.85.220.43 as permitted sender) Received: from [209.85.220.43] (HELO mail-pa0-f43.google.com) (209.85.220.43) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Nov 2014 17:33:50 +0000 Received: by mail-pa0-f43.google.com with SMTP id eu11so19553166pac.2 for ; Sat, 15 Nov 2014 09:33:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=1zJcogfVp15HH2T4pkgBQ1cpbe0wYcQc33vDzyulw4k=; b=Zg0j+YhqKO/kptydhtLWqAwavZOfQ8wOZSkXMwdMDC++vevbGBcQhK+SxpweuJzfSr JUOkCLi6AoIRDF0C69XuxfVB3a9PMl2CK9WvOpmb8AkzPMqynEQc0g/y0aclR8y5gYSJ eUaspERud0TpctbUPk8xfH4cq4cVfwKc43En3f2xb7ooWW3F6d7BolZIt/JWqmpSwRzd /EFvqp+oKjkffxcT7sv+pBHET9bWXJgt1pOHAuJ9gbCYBLk1iWWigxAjZ4p8aGESVrhs hQTw+Rp3SCu3mYbUWpcF+2CzgHP0Rdz9wN/VFzLCDLYm1xZ5zEpYs/EGPLu/dmZp83gO hijQ== X-Received: by 10.68.164.65 with SMTP id yo1mr18070299pbb.126.1416072784267; Sat, 15 Nov 2014 09:33:04 -0800 (PST) Received: from [192.168.0.2] (c-24-22-234-117.hsd1.wa.comcast.net. [24.22.234.117]) by mx.google.com with ESMTPSA id mz5sm26832460pdb.35.2014.11.15.09.33.02 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 15 Nov 2014 09:33:03 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.0 \(1990.1\)) Subject: Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity From: Pat Ferrel In-Reply-To: Date: Sat, 15 Nov 2014 09:33:01 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: <4726232D-2239-4DE2-928D-D4AEAEB32AC3@gmail.com> References: <543BF61D.7050906@orbit-x.de> <543BFE96.9050701@orbit-x.de> To: "user@mahout.apache.org" X-Mailer: Apple Mail (2.1990.1) X-Virus-Checked: Checked by ClamAV on apache.org I=E2=80=99ll add a new option to escape any spark options and put them = directly into the SparkConf for the job before the context is created. The CLI will be something like -D xxx=3Dyyy so for this case you can = change the default parallelism with=20 -D spark.default.parallelism=3D400 If the logic holds that you can often have 16 to 8 x your number of = cores then running locally on my laptop with local[7] should have -D = spark.default.parallelism=3D112 or 56 If you want this value set for your entire cluster you should be able to = set it in the conf files when you launch the cluster. We don=E2=80=99t = change any of those values in the client except spark.executor.memory = (only if specified) and any escaped values.=20 On Oct 13, 2014, at 11:32 AM, Ted Dunning wrote: On Mon, Oct 13, 2014 at 12:32 PM, Reinis Vicups = wrote: >=20 > Do you think that simply increasing this parameter is a safe and sane >> thing >> to do? >>=20 >=20 > Why would it be unsafe? >=20 > In my own implementation I am using 400 tasks on my 4-node-2cpu = cluster > and the execution times of largest shuffle stage have dropped around = 10 > times. > I have number of test values back from the time when I used "old" > RowSimilarityJob and with some exceptions (I guess due to randomized > sparsization) I still have approx. the same values with my own row > similarity implementation. >=20 Splitting things too far can make processes much less efficient. = Setting parameters like this may propagate further than desired. I asked because I don't know, however.