Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5D77418F56 for ; Mon, 19 Oct 2015 17:22:53 +0000 (UTC) Received: (qmail 57985 invoked by uid 500); 19 Oct 2015 17:22:49 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 57892 invoked by uid 500); 19 Oct 2015 17:22:49 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 57882 invoked by uid 99); 19 Oct 2015 17:22:49 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Oct 2015 17:22:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id E4B51C2A8A for ; Mon, 19 Oct 2015 17:22:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.212 X-Spam-Level: **** X-Spam-Status: No, score=4.212 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_HEX=1.313] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id EsGKRglxyDLl for ; Mon, 19 Oct 2015 17:22:43 +0000 (UTC) Received: from mail-io0-f180.google.com (mail-io0-f180.google.com [209.85.223.180]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id F0935439B6 for ; Mon, 19 Oct 2015 17:22:42 +0000 (UTC) Received: by iofz202 with SMTP id z202so58880072iof.2 for ; Mon, 19 Oct 2015 10:22:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=X19IdqccrP8dgmORMED8DHHfTRUIEFlXf7NTHl35lHw=; b=C1mavlo0DomXJMXp4i5Sk5vtOKu876YYr0OOQhsQZKvg6PIlUKWApdmQJFR42mLB9h HonzC/5voiGGe9m1t4wCten6EzafkUI6/aaraimHLwj8V0M0xyIMwsK4h6ajbhoUP96P U1Ts6ufmkUZqzRh5cnl2w1weUpPMrWr0u112wLONMh/nodn/KGr2XoZh4P2msAvJEca8 O+aIcd8HQ0qiOsgUdvnluSCRe0plm8P77uD/9yL0fGVB+mrE1ZeTCNCJivLSyca2khu/ homsO3oKK+1f+eq/LSSCtA/icXFPFOz85KvrUM50j0CNPXwb9g8NpubEPvjdwE33HrQQ rmlA== X-Received: by 10.107.34.193 with SMTP id i184mr31566448ioi.92.1445275362544; Mon, 19 Oct 2015 10:22:42 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.56.196 with HTTP; Mon, 19 Oct 2015 10:22:23 -0700 (PDT) In-Reply-To: References: From: Jia Zhan Date: Mon, 19 Oct 2015 10:22:23 -0700 Message-ID: Subject: Re: In-memory computing and cache() in Spark To: Sonal Goyal Cc: user@spark.apache.org Content-Type: multipart/alternative; boundary=001a1140e78696551405227864c0 --001a1140e78696551405227864c0 Content-Type: text/plain; charset=UTF-8 Hi Sonal, I tried changing the size spark.executor.memory but noting changes. It seems when I run locally in one machine, the RDD is cached in driver memory instead of executor memory. Here is a related post online: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-td22279.html When I change spark.driver.memory, I can see the change of cached data in web UI. Like I mentioned, when I set driver memory to 2G, it says 6% RDD cached. When set to 15G, it says 48% RDD cached, but with much slower speed! On Sun, Oct 18, 2015 at 10:32 PM, Sonal Goyal wrote: > Hi Jia, > > RDDs are cached on the executor, not on the driver. I am assuming you are > running locally and haven't changed spark.executor.memory? > > Sonal > On Oct 19, 2015 1:58 AM, "Jia Zhan" wrote: > > Anyone has any clue what's going on.? Why would caching with 2g memory > much faster than with 15g memory? > > Thanks very much! > > On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan wrote: > >> Hi all, >> >> I am running Spark locally in one node and trying to sweep the memory >> size for performance tuning. The machine has 8 CPUs and 16G main memory, >> the dataset in my local disk is about 10GB. I have several quick questions >> and appreciate any comments. >> >> 1. Spark performs in-memory computing, but without using RDD.cache(), >> will anything be cached in memory at all? My guess is that, without >> RDD.cache(), only a small amount of data will be stored in OS buffer cache, >> and every iteration of computation will still need to fetch most data from >> disk every time, is that right? >> >> 2. To evaluate how caching helps with iterative computation, I wrote a >> simple program as shown below, which basically consists of one saveAsText() >> and three reduce() actions/stages. I specify "spark.driver.memory" to >> "15g", others by default. Then I run three experiments. >> >> * val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*) >> >> *val* *sc* = *new* *SparkContext*(conf) >> >> *val* *input* = sc.textFile(*"/InputFiles"*) >> >> *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word >> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*) >> >> *val* *ITERATIONS* = *3* >> >> *for* (i *<-* *1* to *ITERATIONS*) { >> >> *val* *totallength* = input.filter(line*=>*line.contains( >> *"the"*)).map(s*=>*s.length).reduce((a,b)*=>*a+b) >> >> } >> >> (I) The first run: no caching at all. The application finishes in ~12 >> minutes (2.6min+3.3min+3.2min+3.3min) >> >> (II) The second run, I modified the code so that the input will be >> cached: >> *val input = sc.textFile("/InputFiles").cache()* >> The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)! >> The storage page in Web UI shows 48% of the dataset is cached, >> which makes sense due to large java object overhead, and >> spark.storage.memoryFraction is 0.6 by default. >> >> (III) However, the third run, same program as the second one, but I >> changed "spark.driver.memory" to be "2g". >> The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!! >> And UI shows 6% of the data is cached. >> >> *From the results we can see the reduce stages finish in seconds, how >> could that happen with only 6% cached? Can anyone explain?* >> >> I am new to Spark and would appreciate any help on this. Thanks! >> >> Jia >> >> >> >> > > > -- > Jia Zhan > > -- Jia Zhan --001a1140e78696551405227864c0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Sonal,

I tried changing the size spa= rk.executor.memory but noting changes. It seems when I run locally in one m= achine, the RDD is cached in driver memory instead of executor memory. Here= is a related post online:=C2=A0http://apache= -spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-td22279.= html

When I change spark.driver.memory, I can = see the change of cached data in =C2=A0web UI. Like I mentioned, when I set= driver memory to 2G, it says 6% RDD cached. When set to 15G, it says 48% R= DD cached, but with much slower speed!=C2=A0

On Sun, Oct 18, 2015 at 10:32 PM, So= nal Goyal <sonalgoyal4@gmail.com> wrote:

Hi Jia,

RDDs are cached on the executor, not on the driver. I am ass= uming you are running locally and haven't changed spark.executor.memory= ?

Sonal

On Oct 19, 2015 1:58 AM, "Jia Zhan" &l= t;zhanjiahit@gmai= l.com> wrote:
Anyo= ne has any clue what's going on.? Why would caching with 2g memory much= faster than with 15g memory?

Thanks very much!

On Fri, O= ct 16, 2015 at 2:02 PM, Jia Zhan <zhanjiahit@gmail.com> w= rote:
Hi all,

I am running Spark locally in one node and trying to sweep the mem= ory size for performance tuning. The machine has 8 CPUs and 16G main memory= , the dataset in my local disk is about 10GB. I have several quick question= s and appreciate any comments.

1. Spark performs i= n-memory computing, but without using RDD.cache(), will anything be cached = in memory at all? My guess is that, without RDD.cache(), only a small amoun= t of data will be stored in OS buffer cache, and every iteration of computa= tion will still need to fetch most data from disk every time, is that right= ?=C2=A0

2. To evaluate how caching helps with iter= ative computation, I wrote a simple program as shown below, which basically= consists of one saveAsText() and three reduce() actions/stages. I specify = "spark.driver.memory" to "15g", others by default. Then= I run three experiments.

=C2=A0 =C2=A0 =C2=A0 =C2=A0val conf =3D new SparkConf().setAppName("wordCoun= t")

=C2=A0 =C2=A0 =C2=A0 =C2=A0val sc =3D new = SparkContext(conf)

=C2=A0 =C2=A0 =C2=A0 =C2=A0val input =3D sc.textFile("= /InputFiles")

=C2=A0 =C2=A0 =C2=A0 val <= span>words =3D input.flatMap(line =3D&g= t; line.split(" ")).map(word =3D> (word, = 1)).reduceByKey(_+_).saveAsTextFile(&qu= ot;/OutputFiles")

=C2=A0 =C2=A0 =C2=A0 val <= span>ITERATIONS =3D 3

=C2=A0 =C2=A0 =C2=A0=C2=A0for (i = <- 1 t= o ITERATIONS) {

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 val<= span> totallength =3D input.filter(line=3D>line.contains("th= e")).map(s=3D>s.l= ength).reduce((a,b)=3D>a+b)

=C2=A0 =C2=A0 =C2=A0 }


(I) The= first run: no caching at all. The application finishes in ~12 minutes (2.6= min+3.3min+3.2min+3.3min)

(II) The second run, I m= odified the code so that the input will be cached:=C2=A0
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0val input =3D= sc.textFile("/InputFiles").cache()
=C2=A0 =C2=A0 = =C2=A0The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!=
=C2=A0 =C2=A0 =C2=A0The storage page in Web UI shows 48% of the = dataset =C2=A0is cached, which makes sense due to large java object overhea= d, and spark.storage.memoryFraction is 0.6 by default.

=
(III) However, the third run, same program as the second one, but I ch= anged "spark.driver.memory" to be "2g".
=C2= =A0 =C2=A0The application finishes in just 3.6 minutes (3.0min + 9s + 9s + = 9s)!! And UI shows 6% of the data is cached.
=C2=A0 =C2=A0From= the results we can see the reduce stages finish in seconds, how could that= happen with only 6% cached? Can anyone explain?

I am new to Spark and would appreciate any help on this. Thanks!

Jia



=



--
Jia Zhan

<= /div>



--
=
Jia Zhan

<= /div>
--001a1140e78696551405227864c0--