Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CAKTawbJ+mru1v0zPGvr4fHX9FrfcjG5s7U+o-SfNVBLOMdFMaA@mail.gmail.com>
References: 
 <CAG=L2fuYb5k1gRYjhdaL_uF1Ai7uV7ZsmpfNL4HcZRKYz=0QkA@mail.gmail.com>
 <CAG=L2fv+j74_roHiEyYJ_7qas=0C2ioOP7CwzsygiDe1_WOxwA@mail.gmail.com>
 <CAKTawbJhxYROM4tWA2UP+MSXLJT4seaWzd9NnpKzhBBHJ2L0eQ@mail.gmail.com>
 <CAKTawbJ+mru1v0zPGvr4fHX9FrfcjG5s7U+o-SfNVBLOMdFMaA@mail.gmail.com>
From: Jia Zhan <zhanjiahit@gmail.com>
Date: Mon, 19 Oct 2015 10:22:23 -0700
Message-ID: 
 <CAG=L2fuo37nRNTWVg8v-QDSXsiDV9Axf-y_MsnYOzQKhVrgTwQ@mail.gmail.com>
Subject: Re: In-memory computing and cache() in Spark
To: Sonal Goyal <sonalgoyal4@gmail.com>
Cc: user@spark.apache.org
Content-Type: multipart/alternative; boundary=001a1140e78696551405227864c0

--001a1140e78696551405227864c0
Content-Type: text/plain; charset=UTF-8

Hi Sonal,

I tried changing the size spark.executor.memory but noting changes. It
seems when I run locally in one machine, the RDD is cached in driver memory
instead of executor memory. Here is a related post online:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-td22279.html

When I change spark.driver.memory, I can see the change of cached data in
 web UI. Like I mentioned, when I set driver memory to 2G, it says 6% RDD
cached. When set to 15G, it says 48% RDD cached, but with much slower
speed!

On Sun, Oct 18, 2015 at 10:32 PM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:

> Hi Jia,
>
> RDDs are cached on the executor, not on the driver. I am assuming you are
> running locally and haven't changed spark.executor.memory?
>
> Sonal
> On Oct 19, 2015 1:58 AM, "Jia Zhan" <zhanjiahit@gmail.com> wrote:
>
> Anyone has any clue what's going on.? Why would caching with 2g memory
> much faster than with 15g memory?
>
> Thanks very much!
>
> On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan <zhanjiahit@gmail.com> wrote:
>
>> Hi all,
>>
>> I am running Spark locally in one node and trying to sweep the memory
>> size for performance tuning. The machine has 8 CPUs and 16G main memory,
>> the dataset in my local disk is about 10GB. I have several quick questions
>> and appreciate any comments.
>>
>> 1. Spark performs in-memory computing, but without using RDD.cache(),
>> will anything be cached in memory at all? My guess is that, without
>> RDD.cache(), only a small amount of data will be stored in OS buffer cache,
>> and every iteration of computation will still need to fetch most data from
>> disk every time, is that right?
>>
>> 2. To evaluate how caching helps with iterative computation, I wrote a
>> simple program as shown below, which basically consists of one saveAsText()
>> and three reduce() actions/stages. I specify "spark.driver.memory" to
>> "15g", others by default. Then I run three experiments.
>>
>> *       val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)
>>
>>        *val* *sc* = *new* *SparkContext*(conf)
>>
>>        *val* *input* = sc.textFile(*"/InputFiles"*)
>>
>>       *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
>> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)
>>
>>       *val* *ITERATIONS* = *3*
>>
>>       *for* (i *<-* *1* to *ITERATIONS*) {
>>
>>           *val* *totallength* = input.filter(line*=>*line.contains(
>> *"the"*)).map(s*=>*s.length).reduce((a,b)*=>*a+b)
>>
>>       }
>>
>> (I) The first run: no caching at all. The application finishes in ~12
>> minutes (2.6min+3.3min+3.2min+3.3min)
>>
>> (II) The second run, I modified the code so that the input will be
>> cached:
>>                  *val input = sc.textFile("/InputFiles").cache()*
>>      The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
>>      The storage page in Web UI shows 48% of the dataset  is cached,
>> which makes sense due to large java object overhead, and
>> spark.storage.memoryFraction is 0.6 by default.
>>
>> (III) However, the third run, same program as the second one, but I
>> changed "spark.driver.memory" to be "2g".
>>    The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
>> And UI shows 6% of the data is cached.
>>
>> *From the results we can see the reduce stages finish in seconds, how
>> could that happen with only 6% cached? Can anyone explain?*
>>
>> I am new to Spark and would appreciate any help on this. Thanks!
>>
>> Jia
>>
>>
>>
>>
>
>
> --
> Jia Zhan
>
>


-- 
Jia Zhan

--001a1140e78696551405227864c0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Sonal,<div><br></div><div>I tried changing the size spa=
rk.executor.memory but noting changes. It seems when I run locally in one m=
achine, the RDD is cached in driver memory instead of executor memory. Here=
 is a related post online:=C2=A0<a href=3D"http://apache-spark-user-list.10=
01560.n3.nabble.com/Running-Spark-in-Local-Mode-td22279.html">http://apache=
-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-td22279.=
html</a></div><div><br></div><div>When I change spark.driver.memory, I can =
see the change of cached data in =C2=A0web UI. Like I mentioned, when I set=
 driver memory to 2G, it says 6% RDD cached. When set to 15G, it says 48% R=
DD cached, but with much slower speed!=C2=A0</div></div><div class=3D"gmail=
_extra"><br><div class=3D"gmail_quote">On Sun, Oct 18, 2015 at 10:32 PM, So=
nal Goyal <span dir=3D"ltr">&lt;<a href=3D"mailto:sonalgoyal4@gmail.com" ta=
rget=3D"_blank">sonalgoyal4@gmail.com</a>&gt;</span> wrote:<br><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex"><p dir=3D"ltr">Hi Jia,</p>
<p dir=3D"ltr">RDDs are cached on the executor, not on the driver. I am ass=
uming you are running locally and haven&#39;t changed spark.executor.memory=
?</p><span class=3D"HOEnZb"><font color=3D"#888888">
<p dir=3D"ltr">Sonal</p></font></span><div class=3D"HOEnZb"><div class=3D"h=
5">
<div class=3D"gmail_quote">On Oct 19, 2015 1:58 AM, &quot;Jia Zhan&quot; &l=
t;<a href=3D"mailto:zhanjiahit@gmail.com" target=3D"_blank">zhanjiahit@gmai=
l.com</a>&gt; wrote:<br type=3D"attribution"><blockquote style=3D"margin:0 =
0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Anyo=
ne has any clue what&#39;s going on.? Why would caching with 2g memory much=
 faster than with 15g memory?<div><br></div><div>Thanks very much!</div></d=
iv><div class=3D"gmail_extra"><div><br><div class=3D"gmail_quote">On Fri, O=
ct 16, 2015 at 2:02 PM, Jia Zhan <span dir=3D"ltr">&lt;<a href=3D"mailto:zh=
anjiahit@gmail.com" target=3D"_blank">zhanjiahit@gmail.com</a>&gt;</span> w=
rote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi all,<div><br></=
div><div>I am running Spark locally in one node and trying to sweep the mem=
ory size for performance tuning. The machine has 8 CPUs and 16G main memory=
, the dataset in my local disk is about 10GB. I have several quick question=
s and appreciate any comments.</div><div><br></div><div>1. Spark performs i=
n-memory computing, but without using RDD.cache(), will anything be cached =
in memory at all? My guess is that, without RDD.cache(), only a small amoun=
t of data will be stored in OS buffer cache, and every iteration of computa=
tion will still need to fetch most data from disk every time, is that right=
?=C2=A0</div><div><br></div><div>2. To evaluate how caching helps with iter=
ative computation, I wrote a simple program as shown below, which basically=
 consists of one saveAsText() and three reduce() actions/stages. I specify =
&quot;spark.driver.memory&quot; to &quot;15g&quot;, others by default. Then=
 I run three experiments.</div><div>


<p><span><b>=C2=A0 =C2=A0 =C2=A0 =C2=A0val</b></span><span> </span><span><b=
>conf</b></span><span> =3D </span><span><b>new</b></span><span> </span><spa=
n><b>SparkConf</b></span><span>().setAppName(</span><span><b>&quot;wordCoun=
t&quot;</b></span><span>)</span></p>
<p><span>=C2=A0 =C2=A0 =C2=A0 =C2=A0</span><span><b>val</b></span><span> </=
span><span><b>sc</b></span><span> =3D </span><span><b>new</b></span><span> =
</span><span><b>SparkContext</b></span><span>(conf)</span></p>
<p><span>=C2=A0 =C2=A0 =C2=A0 =C2=A0</span><span><b>val</b></span><span> </=
span><span><b>input</b></span><span> =3D sc.textFile(</span><span><b>&quot;=
/InputFiles&quot;</b></span><span>)</span></p><p><span></span></p>
<p><span>=C2=A0 =C2=A0 =C2=A0 </span><span><b>val</b></span><span> </span><=
span><b>words</b></span><span> =3D input.flatMap(line </span><span><b>=3D&g=
t;</b></span><span> line.split(</span><span><b>&quot; &quot;</b></span><spa=
n>)).map(word </span><span><b>=3D&gt;</b></span><span> (word, </span><span>=
<b>1</b></span><span>)).reduceByKey(_+_).saveAsTextFile(</span><span><b>&qu=
ot;/OutputFiles&quot;</b></span><span>)</span></p>
<p><span>=C2=A0 =C2=A0 =C2=A0 </span><span><b>val</b></span><span> </span><=
span><b>ITERATIONS</b></span><span> =3D </span><span><b>3</b></span></p>
<p><span>=C2=A0 =C2=A0 =C2=A0=C2=A0</span><span><b>for</b></span><span> (i =
</span><span><b>&lt;-</b></span><span> </span><span><b>1</b></span><span> t=
o </span><span><b>ITERATIONS</b></span><span>) {</span></p>
<p><span>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 </span><span><b>val</b></span><=
span> </span><span><b>totallength</b></span><span> =3D input.filter(line</s=
pan><span><b>=3D&gt;</b></span><span>line.contains(</span><span><b>&quot;th=
e&quot;</b></span><span>)).map(s</span><span><b>=3D&gt;</b></span><span>s.l=
ength).reduce((a,b)</span><span><b>=3D&gt;</b></span><span>a+b)</span></p>
<p><span>=C2=A0 =C2=A0 =C2=A0 }</span></p></div><div><br></div><div>(I) The=
 first run: no caching at all. The application finishes in ~12 minutes (2.6=
min+3.3min+3.2min+3.3min)</div><div><br></div><div>(II) The second run, I m=
odified the code so that the input will be cached:=C2=A0<br></div><div>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0<b>val input =3D=
 sc.textFile(&quot;/InputFiles&quot;).cache()</b></div><div>=C2=A0 =C2=A0 =
=C2=A0The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!=
</div><div>=C2=A0 =C2=A0 =C2=A0The storage page in Web UI shows 48% of the =
dataset =C2=A0is cached, which makes sense due to large java object overhea=
d, and spark.storage.memoryFraction is 0.6 by default.</div><div><br></div>=
<div>(III) However, the third run, same program as the second one, but I ch=
anged &quot;spark.driver.memory&quot; to be &quot;2g&quot;.</div><div>=C2=
=A0 =C2=A0The application finishes in just 3.6 minutes (3.0min + 9s + 9s + =
9s)!! And UI shows 6% of the data is cached.</div><div>=C2=A0 =C2=A0<b>From=
 the results we can see the reduce stages finish in seconds, how could that=
 happen with only 6% cached? Can anyone explain?<br></b></div><div><br></di=
v><div>I am new to Spark and would appreciate any help on this. Thanks!</di=
v><span><font color=3D"#888888"><div><br></div><div>Jia</div><div><br></div=
><div><br></div><div><div><div dir=3D"ltr"><div><div><br></div></div></div>=
</div>
</div></font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div><font color=
=3D"#888888">-- <br><div><div dir=3D"ltr"><div>Jia Zhan<br><div><br></div><=
/div></div></div>
</font></div>
</blockquote></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div class=3D"gmail_signature"><div dir=3D"ltr"><div>Jia Zhan<br><div><br><=
/div></div></div></div>
</div>

--001a1140e78696551405227864c0--