Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A539D18E83 for ; Mon, 30 Nov 2015 04:59:07 +0000 (UTC) Received: (qmail 22028 invoked by uid 500); 30 Nov 2015 04:59:06 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 21960 invoked by uid 500); 30 Nov 2015 04:59:06 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 21948 invoked by uid 99); 30 Nov 2015 04:59:06 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Nov 2015 04:59:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 2A5011A0B99 for ; Mon, 30 Nov 2015 04:59:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.13 X-Spam-Level: *** X-Spam-Status: No, score=3.13 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id ILciqu5Xrx6v for ; Mon, 30 Nov 2015 04:59:00 +0000 (UTC) Received: from mail-vk0-f52.google.com (mail-vk0-f52.google.com [209.85.213.52]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 005BE21270 for ; Mon, 30 Nov 2015 04:59:00 +0000 (UTC) Received: by vkbs1 with SMTP id s1so95362903vkb.1 for ; Sun, 29 Nov 2015 20:58:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=sQKpSBAAJwBJqXGY0eJeTOpXtgiRDZs0jSoHsX6Jq2g=; b=k91Tz3yC/q62jwDTzc7Zjgd0DnS5WHXS2bwHI1ze82Ahx/v/Q4kZ+9xZa8A7ZkdrRe euQvJOfPUZsfTJU0jfDWJrYpYY+V7A0zSFFVvOVNkGaY+YvpYQhLnyJ8jgdAjv3VOIgI jaC80gvDI+sBWaBIK6yLQliKpIXeYkQ/vwKBLrq8f8FNEFyJC75hd/JJO6GUtxR/Azm2 mf/EljWm9lCO9MddRt0+0hIKmSswbvZUMigeu/3g4MM3AM+nAPH7rnIyn6u/BLm65kVf 5Mi8+9S1tpFEXuZquflew5UHDapIvf06coAJNkNP53ZFgRsSPYFmCX3l8PzTfFDKooqI /8/g== MIME-Version: 1.0 X-Received: by 10.31.169.137 with SMTP id s131mr52509608vke.144.1448859538985; Sun, 29 Nov 2015 20:58:58 -0800 (PST) Received: by 10.31.154.70 with HTTP; Sun, 29 Nov 2015 20:58:58 -0800 (PST) In-Reply-To: <999A0F59-2295-40CF-8023-85856DFCDB20@occamsmachete.com> References: <5A3C54AE-8D57-4BEF-A83A-98707CAF5003@occamsmachete.com> <999A0F59-2295-40CF-8023-85856DFCDB20@occamsmachete.com> Date: Sun, 29 Nov 2015 20:58:58 -0800 Message-ID: Subject: Re: using spark-submit to launch CLI jobs From: Dmitriy Lyubimov To: "dev@mahout.apache.org" Content-Type: multipart/alternative; boundary=001a11425bf0269bd40525bae68e --001a11425bf0269bd40525bae68e Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Sat, Nov 28, 2015 at 10:55 AM, Pat Ferrel wrote: > I use spark-submit also to launch apps that use Mahout so not sure what > assumptions you are talking about. Ok so if it works what's the problem. I am lost. I am talking about assumptions that anything dealing with context needs to be changed or even removed. > The first thing is to use spark-submit in our own launch script. > What script would that be? > The current code calls the CLI mahout script to get classpath info, this > should be passed in to the Which code? mahout context creation? As i said, you can customize that behavior. You can tell it not to look for standard jars + get your own jars into classpath. Should be flexible enough to handle any startup situation. > spark-submit so if we launch with spark-submit I think the call of the > mahout script would be unnecessary. This makes it more straightforward to > use with Yarn cluster mode where the client/driver is launched on some > cluster machine where there would be no script to call. > Again, see comment above. Yes, i did submits to yarn and standalone, you name it. it is all good. > > If the SparkMahoutContext is a hard requirement that=E2=80=99s fine. Every single operation uses context (which essentially wraps backend context). it is not passed in, it is implied by a dataset parameter. No physical operator can work without it. For most part, context is required because the backend engines require a session equivalent of it (SparkContext in Spark's case). This is more a hard requirement on the backend part. > As I said, I don=E2=80=99t understand all of those ramifications. > > On Nov 27, 2015, at 8:25 PM, Dmitriy Lyubimov wrote: > > I do submits all the time, don't see any problem. It is part of my standa= rd > stress test harness. > > Mahout context is conceptual and cannot be removed, nor it is required to > be removed in order to run submitted jobs. Submission and contexts are tw= o > completely separate concepts. One can submit a job that for example doesn= 't > set up a spark job at all and runs for example a Mr job, or just > manipulates some HDFS directories, or sets up multiple jobs or combinatio= ns > of all of the above. All submission means is sending an Uber jar to an > application server and launching a main class there, instead of doing the > same locally. Not sure where these all assumptions are coming from. > On Nov 27, 2015 11:33 AM, "Pat Ferrel" wrote: > > > Currently we create a SparkMahoutContext, and use =E2=80=9Cmahout -spar= k > > classpath=E2=80=9D to create the SparkContext. the SparkConf is also di= rectly > > accessed. If we move to using spark-submit for launching the Mahout She= ll > > and other drivers we would need to refactor some of this and change the > > mahout script. It seems desirable to have and driver code create the > Spark > > context and rely on spark-submit for any config overrides and params. > This > > implies the possible removal (not sure about this) of SparkMahoutContex= t. > > In general it would be nice if this were done outside of Mahout, or > limited > > to the drivers and shell. Mahout has become a library that is designed = to > > be backend independent. This code was designed before this became a goa= l > > and is beyond my understanding to fully grasp how much work would be > > involved and what would replace it. > > > > The code refactoring needed is not well understood, by me at least. But > > intuition says that with a growing number of backends it might be good = to > > clean up the Spark dependencies for context management. This has also > been > > a bit of a problem in creating apps that use Mahout since typical > > spark-submit use cannot be relied on to make config changes, they must = be > > made in environment variables only. These arguably non-standard > > manipulation of the context puts limitations and hidden assumptions int= o > > using Mahout as a library. > > > > Doing all of this implies a fairly large bit of work, I think. The > benefit > > is that it will be more clear how to use Mahout as a library and in > > cleaning up some unneeded code. I=E2=80=99m not sure I have enough time= to do all > > of this myself. > > > > This isn=E2=80=99t so much a proposal as a call for discussion. > > > > > > > > --001a11425bf0269bd40525bae68e--