Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CF0A01880E for ; Sat, 17 Oct 2015 05:48:04 +0000 (UTC) Received: (qmail 97722 invoked by uid 500); 17 Oct 2015 05:48:01 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 97623 invoked by uid 500); 17 Oct 2015 05:48:00 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 97613 invoked by uid 99); 17 Oct 2015 05:48:00 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Oct 2015 05:48:00 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 5BB1BC1842 for ; Sat, 17 Oct 2015 05:48:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.901 X-Spam-Level: ** X-Spam-Status: No, score=2.901 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id XE7YtsqJTSur for ; Sat, 17 Oct 2015 05:47:46 +0000 (UTC) Received: from mail-yk0-f177.google.com (mail-yk0-f177.google.com [209.85.160.177]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id C20B242BA7 for ; Sat, 17 Oct 2015 05:47:45 +0000 (UTC) Received: by ykdz2 with SMTP id z2so18358581ykd.3 for ; Fri, 16 Oct 2015 22:47:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=wddvizbSYRquG63kf3cTnDYqiSbW0JC8+aFp6DZ9dtM=; b=MvnGa2krZ7tYp/gni3X6w/REG/lC6SoyLsIcRIsE0Wm9a/C9b/mVWaHYU+1qHv5EZ9 PRwGlDdxJFUoUYXdlfVTj0lLD8RKz1ODtPZasAn71xSSIj63TnalTtRBcd27gtGTgzL2 ATbst65sBLwgTD3yL8XjZu17HlJXbOYr1tdQDSCEzX7ugOMXxzpci/SQEhioym4o5UnZ ccWP65cpHl63ECHJBCdNWpY/5BjCqRtNAvl1Y7f57J0vEfAe6CfCzdDL0ZwVDy/xT+Uc 3mACjVCS+u65jgvj6NhINCuyx0SVH39M748mf2JGdGe+wuaOdJrljt3p7+PmiMOcy4Eb Yv9Q== MIME-Version: 1.0 X-Received: by 10.129.87.71 with SMTP id l68mr12579209ywb.233.1445060858858; Fri, 16 Oct 2015 22:47:38 -0700 (PDT) Received: by 10.13.215.138 with HTTP; Fri, 16 Oct 2015 22:47:38 -0700 (PDT) In-Reply-To: References: Date: Sat, 17 Oct 2015 11:17:38 +0530 Message-ID: Subject: Re: Spark on Mesos / Executor Memory From: Bharath Ravi Kumar To: Tim Chen , user Cc: user@mesos.apache.org Content-Type: multipart/alternative; boundary=001a113aa3c42bd6e405224673e0 --001a113aa3c42bd6e405224673e0 Content-Type: text/plain; charset=UTF-8 Can someone respond if you're aware of the reason for such a memory footprint? It seems unintuitive and hard to reason about. Thanks, Bharath On Thu, Oct 15, 2015 at 12:29 PM, Bharath Ravi Kumar wrote: > Resending since user@mesos bounced earlier. My apologies. > > On Thu, Oct 15, 2015 at 12:19 PM, Bharath Ravi Kumar > wrote: > >> (Reviving this thread since I ran into similar issues...) >> >> I'm running two spark jobs (in mesos fine grained mode), each belonging >> to a different mesos role, say low and high. The low:high mesos weights are >> 1:10. On expected lines, I see that the low priority job occupies cluster >> resources to the maximum extent when running alone. However, when the high >> priority job is submitted, it does not start and continues to await cluster >> resources (as seen in the logs). Since the jobs run in fine grained mode >> and the low priority tasks begin to finish, the high priority job should >> ideally be able to start and gradually take over cluster resources as per >> the weights. However, I noticed that while the "low" job gives up CPU cores >> with each completing task (e.g. reduction from 72 -> 12 with default >> parallelism set to 72), the memory resources are held on (~500G out of >> 768G). The spark.executor.memory setting appears to directly impact the >> amount of memory that the job holds on to. In this case, it was set to 200G >> in the low priority task and 100G in the high priority task. The nature of >> these jobs is such that setting the numbers to smaller values (say 32g) >> resulted in job failures with outofmemoryerror. It appears that the spark >> framework is retaining memory (across tasks) proportional to >> spark.executor.memory for the duration of the job and not releasing memory >> as tasks complete. This defeats the purpose of fine grained mode execution >> as the memory occupancy is preventing the high priority job from accepting >> the prioritized cpu offers and beginning execution. Can this be explained / >> documented better please? >> >> Thanks, >> Bharath >> >> On Sat, Apr 11, 2015 at 10:59 PM, Tim Chen wrote: >> >>> (Adding spark user list) >>> >>> Hi Tom, >>> >>> If I understand correctly you're saying that you're running into memory >>> problems because the scheduler is allocating too much CPUs and not enough >>> memory to acoomodate them right? >>> >>> In the case of fine grain mode I don't think that's a problem since we >>> have a fixed amount of CPU and memory per task. >>> However, in coarse grain you can run into that problem if you're with in >>> the spark.cores.max limit, and memory is a fixed number. >>> >>> I have a patch out to configure how much max cpus should coarse grain >>> executor use, and it also allows multiple executors in coarse grain mode. >>> So you could say try to launch multiples of max 4 cores with >>> spark.executor.memory (+ overhead and etc) in a slave. ( >>> https://github.com/apache/spark/pull/4027) >>> >>> It also might be interesting to include a cores to memory multiplier so >>> that with a larger amount of cores we try to scale the memory with some >>> factor, but I'm not entirely sure that's intuitive to use and what people >>> know what to set it to, as that can likely change with different workload. >>> >>> Tim >>> >>> >>> >>> >>> >>> >>> >>> On Sat, Apr 11, 2015 at 9:51 AM, Tom Arnfeld wrote: >>> >>>> We're running Spark 1.3.0 (with a couple of patches over the top for >>>> docker related bits). >>>> >>>> I don't think SPARK-4158 is related to what we're seeing, things do run >>>> fine on the cluster, given a ridiculously large executor memory >>>> configuration. As for SPARK-3535 although that looks useful I think we'e >>>> seeing something else. >>>> >>>> Put a different way, the amount of memory required at any given time by >>>> the spark JVM process is directly proportional to the amount of CPU it has, >>>> because more CPU means more tasks and more tasks means more memory. Even if >>>> we're using coarse mode, the amount of executor memory should be >>>> proportionate to the amount of CPUs in the offer. >>>> >>>> On 11 April 2015 at 17:39, Brenden Matthews >>>> wrote: >>>> >>>>> I ran into some issues with it a while ago, and submitted a couple PRs >>>>> to fix it: >>>>> >>>>> https://github.com/apache/spark/pull/2401 >>>>> https://github.com/apache/spark/pull/3024 >>>>> >>>>> Do these look relevant? What version of Spark are you running? >>>>> >>>>> On Sat, Apr 11, 2015 at 9:33 AM, Tom Arnfeld wrote: >>>>> >>>>>> Hey, >>>>>> >>>>>> Not sure whether it's best to ask this on the spark mailing list or >>>>>> the mesos one, so I'll try here first :-) >>>>>> >>>>>> I'm having a bit of trouble with out of memory errors in my spark >>>>>> jobs... it seems fairly odd to me that memory resources can only be set at >>>>>> the executor level, and not also at the task level. For example, as far as >>>>>> I can tell there's only a *spark.executor.memory* config option. >>>>>> >>>>>> Surely the memory requirements of a single executor are quite >>>>>> dramatically influenced by the number of concurrent tasks running? Given a >>>>>> shared cluster, I have no idea what % of an individual slave my executor is >>>>>> going to get, so I basically have to set the executor memory to a value >>>>>> that's correct when the whole machine is in use... >>>>>> >>>>>> Has anyone else running Spark on Mesos come across this, or maybe >>>>>> someone could correct my understanding of the config options? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Tom. >>>>>> >>>>> >>>>> >>>> >>> >> > --001a113aa3c42bd6e405224673e0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Can someone respond if you're aware of the r= eason for such a memory footprint? It seems unintuitive and hard to reason = about.

Thanks,
Bharath

On Thu, Oct 15, 2015 at 12:29 PM, Bha= rath Ravi Kumar <reachbach@gmail.com> wrote:
Resending since user@mesos bounced ea= rlier. My apologies.

On Thu, Oct 15, 2015 a= t 12:19 PM, Bharath Ravi Kumar <reachbach@gmail.com> wrote= :
(Reviving this th= read since I ran into similar issues...)

I'm running two s= park jobs (in mesos fine grained mode), each belonging to a different mesos= role, say low and high. The low:high mesos weights are 1:10. On expected l= ines, I see that the low priority job occupies cluster resources to the max= imum extent when running alone. However, when the high priority job is subm= itted, it does not start and continues to await cluster resources (as seen = in the logs). Since the jobs run in fine grained mode and the low priority = tasks begin to finish, the high priority job should ideally be able to star= t and gradually take over cluster resources as per the weights. However, I = noticed that while the "low" job gives up CPU cores with each com= pleting task (e.g. reduction from 72 -> 12 with default parallelism set = to 72), the memory resources are held on (~500G out of 768G). The spark.exe= cutor.memory setting appears to directly impact the amount of memory that t= he job holds on to. In this case, it was set to 200G in the low priority ta= sk and 100G in the high priority task. The nature of these jobs is such tha= t setting the numbers to smaller values (say 32g) resulted in job failures = with outofmemoryerror.=C2=A0 It appears that the spark framework is retaini= ng memory (across tasks)=C2=A0 proportional to spark.executor.memory for th= e duration of the job and not releasing memory as tasks complete. This defe= ats the purpose of fine grained mode execution as the memory occupancy is p= reventing the high priority job from accepting the prioritized cpu offers a= nd beginning execution. Can this be explained / documented better please? <= br>
Thanks,
Bharath
On Sat, Apr 11, 2015 at 10:59 PM, Tim Chen <ti= m@mesosphere.io> wrote:
(Adding spark user list)

Hi Tom,

If I understand correctly you're saying that you're= running into memory problems because the scheduler is allocating too much = CPUs and not enough memory to acoomodate them right?

In the case of fine grain mode I don't think that's a problem si= nce we have a fixed amount of CPU and memory per task.=C2=A0
Howe= ver, in coarse grain you can run into that problem if you're with in th= e spark.cores.max limit, and memory is a fixed number.

=
I have a patch out to configure how much max cpus should coarse grain = executor use, and it also allows multiple executors in coarse grain mode. S= o you could say try to launch multiples of max 4 cores with spark.executor.= memory (+ overhead and etc) in a slave. (https://github.com/apache/spark/pull/= 4027)

It also might be interesting to incl= ude a cores to memory multiplier so that with a larger amount of cores we t= ry to scale the memory with some factor, but I'm not entirely sure that= 's intuitive to use and what people know what to set it to, as that can= likely change with different workload.

Tim
<= div>




<= div>

On Sat, Apr 11, 2015 at 9:51 AM, Tom Arnfeld <tom@duedil.com> wrote:
We're runni= ng Spark 1.3.0 (with a couple of patches over the top for docker related bi= ts).

I don't think SPARK-4158 is related to what we&= #39;re seeing, things do run fine on the cluster, given a ridiculously larg= e executor memory configuration. As for SPARK-3535 although that looks usef= ul I think we'e seeing something else.

Put a d= ifferent way, the amount of memory required at any given time by the spark = JVM process is directly proportional to the amount of CPU it has, because m= ore CPU means more tasks and more tasks means more memory. Even if we'r= e using coarse mode, the amount of executor memory should be proportionate = to the amount of CPUs in the offer.

On 11 April 2015 at 17:39, Bren= den Matthews <brenden@diddyinc.com> wrote:
I ran into some issues with it a while= ago, and submitted a couple PRs to fix it:


Do these look relevant? What vers= ion of Spark are you running?

On Sat, Apr 11, 2015 at 9:33 AM, Tom Arnf= eld <tom@duedil.com> wrote:
<= div dir=3D"ltr">

Hey,

Not sure whether it's best to ask this on the spark maili= ng list or the mesos one, so I'll try here first :-)

I'm havi= ng a bit of trouble with out of memory errors in my spark jobs... it seems = fairly odd to me that memory resources can only be set at the executor leve= l, and not also at the task level. For example, as far as I can tell there&= #39;s only a spark.executor.memory=C2=A0config option.

Surely = the memory requirements of a single executor are quite dramatically influen= ced by the number of concurrent tasks running? Given a shared cluster, I ha= ve no idea what % of an individual slave my executor is going to get, so I = basically have to set the executor memory to a value that's correct whe= n the whole machine is in use...

Has anyone else running Spark on Mes= os come across this, or maybe someone could correct my understanding of the= config options?

Thanks!

Tom.

=






--001a113aa3c42bd6e405224673e0--