Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <DB5PR07MB08060BD40CF3FFC48238AE73F0090@DB5PR07MB0806.eurprd07.prod.outlook.com>
References: <CAMpf5JWdutmStoYLc3w-8Rq_-5nHjwx=yPsDfjtrR1dy-HRKiQ@mail.gmail.com>
 <A2AC0EC3-7A28-4517-8390-66647A513E45@jgp.net> <CAJp6iOsZVwmGG0OffgrMUUCPcd=f3qA5P9uHDMQ4c7SrSubV+w@mail.gmail.com>
 <DB5PR07MB08060BD40CF3FFC48238AE73F0090@DB5PR07MB0806.eurprd07.prod.outlook.com>
From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
Date: Thu, 21 Jul 2016 12:56:06 +0100
Message-ID: <CAJ3fcbB+OPtKq-NH_bJuQux7mFpsrFgwWSDhih-=T8iK+iSikw@mail.gmail.com>
Subject: Re: Understanding spark concepts cluster, master, slave, job, stage,
 worker, executor, task
To: Joaquin Alzola <Joaquin.Alzola@lebara.com>
Cc: "Taotao.Li" <charles.upboy@gmail.com>, Jean Georges Perrin <jgp@jgp.net>,
	Sachin Mittal <sjmittal@gmail.com>, user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a1137b98ecd28ab05382400bd
archived-at: Thu, 21 Jul 2016 11:56:27 -0000

--001a1137b98ecd28ab05382400bd
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I started putting together some Performance and Tuning guide for Spark
starting from the simplest operation Local and Standalone modes but sounds
like I never have the time to finish it!

This is some stuff but is in word and wrapped together in some arbitrary
way.  Anyway if you think it is useful let me know and I try to finish it :=
)

Some of the points we have already discussed in this user group or part of
wider available literature. It is aimed at practitioner.

*Introduction*

According to Spark Website, Apache Spark <http://spark.apache.org/> is a
fast and general purpose engine for large-scale data processing. It is
written mostly in Scala, and provides APIs for Scala, Java and Python. It
is fully compatible with Hadoop Distributed File System (HDFS), however it
extends on Hadoop=E2=80=99s core functionality by providing in-memory clust=
er
computation among other things

Providing in-memory capabilities is probably one of the most import aspects
of Spark that allows one to do computation in-memory. It also supports an
advanced scheduler based on directed acyclic graph (DAG
<https://en.wikipedia.org/wiki/Directed_acyclic_graph>). These capabilities
allow Spark to be used as an advanced query engine with the help of Spark
shell and Spark SQL. For near real time data processing Spark Streaming can
be used. Another important but often understated capability of Spark is
deploying it to be used as an advanced execution engine for other Hadoop
tools such as Hive.

Like most of the tools in Hadoop ecosystem, Spark will require careful
tuning to get the most out of it.

Thus, in these brief notes we will aim to address these points to ensure
that you create an infrastructure for Spark, which is not only performant
but also scalable for your needs.

*Why Spark*

The Hadoop ecosystem is nowadays crowded with a variety of offerings. Some
of them are complementary and others are competing with each other. Spark
is unique in that in a space of relatively short time it has grown much in
its popularity and as of today is one of the most popular tools in the
Hadoop ecosystem.

The fundamental technology of Hadoop using Map-Reduce algorithm as its core
execution engine gave rise to deployment of other methods. Although
Map-Reduce was and still is an incredible technology, it lacked the speed
and performance required for certain business needs like dealing with
real-time analytics. Spark was developed ground up to address these
concerns.


*Overview of Spark Architecture*


Spark much like many other tools runs a set of instructions summarized in
the form of an application. An application consists of a Driver Program
that is responsible for submitting, running and monitoring the code.


 Spark can distribute the work load across what is known as cluster. In
other words, Spark applications run as independent sets of processes on a
cluster. Process is an application running on UNIX/Linux system


A *cluster *is a collection of servers, called nodes that communicate with
each other to make a set of services highly available to the applications
running on them.


Before going any further one can equate a node with a physical host, a VM
host or any other resource capable of providing RAM and core. Some refer to
nodes as machines as well.


*Spark Operations*


Spark takes advantages of a cluster by dividing the workload across this
cluster and executing operations in parallel to speed up the processing. To
affect this Spark provides *as part of its core architecture *an
abstraction layer called a *Resilient Distributed Dataset (RDD). *Simply
put an RDD is a selection of elements (like a sequence, a text file, a CSV
file, data coming in from streaming sources like Twitter, Kafka and so
forth) that one wants to work with.


What an RDD does is to partition that data across the nodes of the Spark
cluster to take advantage of parallel processing.


*RDDs in Spark are immutable *meaning that *they cannot be changed, but can
be acted upon* to create other RDDs and result sets. RDDs can also be
cached in memory (to be precise in the memory allocated to Spark) across
multiple nodes for faster parallel operations.


So that takes care of data. There is a second abstraction in Spark
known as *shared
variables* that can be used in parallel operations. By default, when Spark
runs a function in parallel as a set of tasks on different nodes, it ships
a copy of each variable used in the function to each task. Sometimes, a
variable need to be shared across tasks, or between tasks and the driver
program. *Spark supports two types of shared variables*: *broadcast
variables*, which can be used to cache a value in memory on all nodes, and
*accumulators*, which are variables that are only =E2=80=9Cadded=E2=80=9D t=
o, such as
counters and sums. We will cover them later


We have been mentioning Spark clusters but clusters can be configured
differently as well through what is known as the configuration of cluster
manager. To this effect, Spark currently supports the following
configurations:

   -

   *Spark Local* - Spark runs on the local host. This is the simplest set
   up and best suited for learners who want to understand different concept=
s
   of Spark and those performing unit testing.
   -

   *Spark Standalone *=E2=80=93 a simple cluster manager included with Spar=
k that
   makes it easy to set up a cluster.
   -

   *YARN Cluster Mode,* the Spark driver runs inside an application master
   process which is managed by YARN on the cluster, and the client can go a=
way
   after initiating the application. This is invoked with =E2=80=93master y=
arn
and --deploy-mode
   cluster
   -

   *YARN Client Mode*, the driver runs in the client process, and the
   application master is only used for requesting resources from YARN.
Unlike Spark
   standalone mode, in which the master=E2=80=99s address is specified in t=
he
   --master parameter, in YARN mode the ResourceManager=E2=80=99s address i=
s picked
   up from the Hadoop configuration. Thus, the --master parameter is yarn. =
This
   is invoked with --deploy-mode client


*- client mode requires the process that launched the app remain alive.
Meaning the host where it lives has to stay alive, and it may not be
super-friendly to ssh sessions dying, for example, unless you use nohup. -
client mode driver logs are printed to stderr by default. yes you can
change that, but in cluster mode, they're all collected by yarn without any
user intervention. - if your edge node (from where the app is launched)
isn't really part of the cluster (e.g., lives in an outside network with
firewalls or higher latency), you may run into issues. - in cluster mode,
your driver's cpu / memory usage is accounted for in YARN; this matters if
your edge node is part of the cluster (and could be running yarn
containers), since in client mode your driver will potentially use a lot of
memory / cpu. - finally, in cluster mode YARN can restart your application
without user interference. this is useful for things that need to stay up
(think a long running streaming job, for example).*

*If your client is not close to the cluster (e.g. your PC) then you
definitely want to go cluster to improve performance. If your client is
close to the cluster (e.g. an edge node) then you could go either client or
cluster.  Note that by going client, more resources are going to be used on
the edge node.*

In this part one, we will confine ourselves with *Spark on Local host *and
will leave the other two to other parts.

*Spark Local Mode*

Spark Local Mode is the simplest configuration of Spark that does not
require a Cluster. The user on the local host can launch and experiment
with Spark.


In this mode the driver program (SparkSubmit), the resource manager and
executor all exist within the same JVM. The JVM itself is the worker thread

When you use spark-shell or for that matter spark-sql, you are staring
spark-submit under the bonnet. These two shells are created to make life
easier to work on Spark.


However, if you look at what $SPARK_HOME/bin/spark-shell does in the
script, you will notice my point:


"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name
"Spark shell" "$@"


So that is basically spark-submit JVM invoked with the name "Spark shell"


Since it is using spark-submit it takes all the parameters related to
spark-submit. However, remember that these two shells are created for
read=E2=80=93eval=E2=80=93print loop (REPL) and they default to Local mode.=
 You cannot use
them for example in YARN cluster mode.


Some default parameters can be changed. For example, the default Web GUI
for Spark is 4040. However, I start it with 55555 and modified it to call
it a different name


"${SPARK_HOME}"/bin/spark-submit --conf "spark.ui.port=3D55555" --class
org.apache.spark.repl.Main --name "my own Spark shell" "$@"


Before going further let us understand the concept of cores and threads.
These days we talk about cores more than CPUs. Each CPU comes with a number
of cores.

Simply put to work out the number of threads you can do this:


cat /proc/cpuinfo|grep processor|wc -l


 Which for me it returns 12 and that is all I need to know without worrying
what physical cores, logical cores and CPU really mean as these definitions
may vary from one hardware vendor to another.

On local mode with you have

--master local


This will start with one (worker) *thread *or equivalent to =E2=80=93master
local[1]. You can start by more than one thread by specifying the number of
threads *k* in =E2=80=93master local[k]. You can also start using all avail=
able
threads with =E2=80=93master local[*]. The degree of parallelism is defined=
 by the
number of threads *k*.

In *Local mode*, you do not need to start master and slaves/workers. In
this mode it is pretty simple and you can run as many JVMs (spark-submit)
as your resources allow (resource meaning memory and cores). Additionally,
the GUI starts by default on port 4040, next one on 4041 and so forth
unless you specifically start it with --conf "spark.ui.port=3Dnnnnn"

Remember this is all about testing your apps. It is NOT a performance test.
What it allows you is to test multiple apps concurrently and more
importantly gets you started and understand various configuration
parameters that Spark uses together with spark-submit executable

You can of course use spark-shell and spark-sql utilities. These in turn
rely on spark-submit executable to run certain variations of the JVM. In
other words, you are still executing spark-submit. You can pass parameters
to spark-submit with an example shown below:

${SPARK_HOME}/bin/spark-submit \

                --packages com.databricks:spark-csv_2.11:1.3.0 \

                --driver-memory 2G \

                --num-executors 1 \

                --executor-memory 2G \

                --master local \

                --executor-cores 2 \

                --conf "spark.scheduler.mode=3DFAIR" \

                --conf "spark.executor.extraJavaOptions=3D-XX:+PrintGCDetai=
ls
-XX:+PrintGCTimeStamps" \

                --jars
/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \

                --class "${FILE_NAME}" \

                --conf "spark.ui.port=3D4040=E2=80=9D \

                --conf "spark.driver.port=3D54631" \

                --conf "spark.fileserver.port=3D54731" \

                --conf "spark.blockManager.port=3D54832" \

                --conf "spark.kryoserializer.buffer.max=3D512" \

                ${JAR_FILE} \

                >> ${LOG_FILE}

*Note that in the above example I am only using modest resources. This is
intentional to ensure that resources are available for the other Spark jobs
that I may be testing on this standalone node.*

*Alternatively, you can specify some of these parameters when you are
creating a new SparkConf*

*val sparkConf =3D new SparkConf().*

*             setAppName("CEP_streaming").*

*             setMaster("local").*

*             Set(=E2=80=9Cnum.executors=E2=80=9D, =E2=80=9C1=E2=80=9D).*

*             set("spark.executor.memory", "2G").*

*             set(=E2=80=9Cspark.executor.cores=E2=80=9D, =E2=80=9C2=E2=80=
=9D).*

*             set("spark.cores.max", "2").*

*             set("spark.driver.allowMultipleContexts", "true").*

*             set("spark.hadoop.validateOutputSpecs", "false")*


*You can practically run most of your unit testing with Local mode and
deploy variety of options including running SQL queries, reading data from
CSV files, writing to HDFS, creating Hive tables including ORC tables and
doing Spark Streaming.*

*The components of a Spark App*

*Although this may be of little relevance to Local mode, it would be
beneficial to clarify a number of Spark terminologies here.*

A Spark application consists of a driver program and a list of executors.
The driver program is the main program, which coordinates the executors to
run the Spark application. Executors are worker nodes' processes in charge
of running individual tasks in a given Spark job. The executors run the
tasks assigned by the driver program.  In Local mode, the driver program
runs inside the JVM and the driver program is running on the local machine.
There is only one executor and it is called *driver* and the tasks are
executed by the threads locally as well. This single executor will be
started with *k* threads.

Local mode is different than Standalone mode that uses Spark in-built
cluster set-up*.*

*Driver Program: *The driver is the process started by spark-submit. The
application relies on the initial Spark specific environment setting in the
shell that the application is started to create what is known as *SparkCont=
ext
object*. SparkContext tells the driver program how to access the Spark
cluster among other things. It is a separate Java process. It is identified
as *SparkSubmit* in jps

*Standalone Master* is not required in Local mode

*Standalone Worker* is not required in Local

*Executor *is the program that is launched on the Worker when a Job starts
executing

*Tasks *Each Spark application is broken down into stages and each stage is
completed by one or more tasks. A task is a thread of execution that an
executor runs on a single node.

*Cache* is the memory allocated to this Spark process


Going back to Figure 1, we notice Cache, Executor and Tasks. These are as
follows:

Figure 2 below shows a typical Spark master URL. Note the number of cores
and Memory allocation for each worker. These are default maximum on his
host. Again these are the resource ceilings it does not mean that they go
and grab those values.

*Figure 2: A typical Spark master URL*


Note that as stated each worker grabs all the available cores and allocates
the remaining memory on each host. However, these values are somehow
misleading and are not updated. So I would not worry too much about what it
says in this page.


*Configuring Spark parameters*

To configure Spark shell parameters, you will need to modify the settings
in $SPARK_HOME/conf/spark-env.sh script

Note that the shells in $SPARK_HOME/sbin call $SPARK_HOME/conf/spark-env.sh
scripts. So if you modify this file, remember to restart your master and
slaves=E2=80=99 routines

Every Spark executor in an application that has the same fixed number of
cores and same fixed heap size. The number of cores can be specified with
the --executor-cores flag when invoking spark-submit, spark-shell, and
pyspark from the command line, or by setting the spark.executor.cores
property in the spark-defaults.conf file or on a SparkConf object.
Similarly, the heap size can be controlled with the --executor-memory flag
or the spark.executor.memory property. The cores property controls the
number of concurrent tasks an executor can run*. **--executor-cores 5**
means that each executor can run a maximum of five tasks at the same time.*
The memory property impacts the amount of data Spark can cache, as well as
the maximum sizes of the shuffle data structures used for grouping,
aggregations, and joins.

*The **--num-executors** command-line flag or **spark.executor.instances**
configuration property control the number of executors requeste*d. Starting
in CDH 5.4/Spark 1.3, you will be able to avoid setting this property by
turning on dynamic allocation
<https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-=
allocation>
with the spark.dynamicAllocation.enabled property. Dynamic allocation
enables a Spark application to request executors when there is a backlog of
pending tasks and free up executors when idle.


*Resource scheduling*

The standalone cluster mode currently only supports a simple First In First
Out (FIFO) scheduler across applications. Thus to allow multiple concurrent
users, you can control the maximum number of resources each application
will use.

By default, the default memory used in 512M. You can increase this value by
setting the following parameter in $SPARK_HOME/conf/ spark-defaults.conf

export spark.driver.memory         4g

or by supplying configuration setting at runtime to spark-shell or
spark-submit

*Note that in this mode a process will acquire all cores in the cluster,
which only makes sense if you just run one application at a time*.

You can cap the number of cores by setting spark.cores.max in your
SparkConf. For example:

  val conf =3D new SparkConf().

               setAppName("MyApplication").

               setMaster("local[2]").

               set("spark.executor.memory", "4G").

               set("spark.cores.max", "2").

               set("spark.driver.allowMultipleContexts", "true")

  val sc =3D new SparkContext(conf)


Note that setMaster("local[2]"). Specifies that it is run locally with two
threads

   -

   local uses 1 thread.
   -

   local[N] uses N threads.
   -

   local[*] uses as many threads as there are cores.


However, since driver-memory setting encapsulates the JVM, you will need to
set the amount of driver memory for any non-default value *before starting
JVM by providing the new value:*

${SPARK_HOME}/bin/spark-shell --driver-memory 4g

Or

${SPARK_HOME}/bin/spark-submit --driver-memory 4g


You can of course have a simple SparkConf values and set the additional
Spark configuration parameters at submit time


Example


val sparkConf =3D new SparkConf().

             setAppName("CEP_streaming").

*             setMaster("local[2]").*

             set("spark.streaming.concurrentJobs", "2").

             set("spark.driver.allowMultipleContexts", "true").

             set("spark.hadoop.validateOutputSpecs", "false")


And at submit time do


${SPARK_HOME}/bin/spark-submit \

                --master local[2] \

                --driver-memory 4G \

                --num-executors 1 \

                --executor-memory 4G \

                --executor-cores 2 \

                =E2=80=A6..

Note that this will override earlier Spark configuration parameters with
sparkConf


*Resource Monitoring*

You can see the job progress in Spark Job GUI that by default runs on
<HOST>:4040.
This GUI has different tabs for Jobs, Stages, Executors etc. An example is
shown below:


*Figure 3: A typical Spark Job URL*


Figure 3 shows the status of Jobs. This is a simple job that uses JDBC to
access Oracle database and a table called dummy with 1 billion rows. It
then takes that table, caches it by registering it as temptable, create an
ORC table in Hive and populates that table. It was compiled using Maven and
executed through $SPARK_HOME/sbin/spark-submit.sh


The code is shown below: for ETL_scratchpad_dummy.scala

import org.apache.spark.SparkContext

import org.apache.spark.SparkConf

import org.apache.spark.sql.Row

import org.apache.spark.sql.hive.HiveContext

import org.apache.spark.sql.types._

import org.apache.spark.sql.SQLContext

import org.apache.spark.sql.functions._

object ETL_scratchpad_dummy {

  def main(args: Array[String]) {

  val conf =3D new SparkConf().

               setAppName("ETL_scratchpad_dummy").

               setMaster("local[2]").

               set("spark.executor.memory", "4G").

               set("spark.cores.max", "2").

               set("spark.driver.allowMultipleContexts", "true")

  val sc =3D new SparkContext(conf)

  // Create sqlContext based on HiveContext

  val sqlContext =3D new HiveContext(sc)

  import sqlContext.implicits._

  val HiveContext =3D new org.apache.spark.sql.hive.HiveContext(sc)

  println ("\nStarted at"); sqlContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)


  HiveContext.sql("use oraclehadoop")


  var _ORACLEserver : String =3D "jdbc:oracle:thin:@rhes564:1521:mydb12"

  var _username : String =3D "scratchpad"

  var _password : String =3D "xxxxxx"


  // Get data from Oracle table scratchpad.dummy


  val d =3D HiveContext.load("jdbc",

  Map("url" -> _ORACLEserver,

  "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED,
to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS RANDOMISED,
RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",

  "user" -> _username,

  "password" -> _password))


  d.registerTempTable("tmp")

  //

  // Need to create and populate target ORC table oraclehadoop.dummy

  //

  HiveContext.sql("use oraclehadoop")

  //

  // Drop and create table dummy

  //

  HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")

  var sqltext : String =3D ""

  sqltext =3D """

  CREATE TABLE oraclehadoop.dummy (

     ID INT

   , CLUSTERED INT

   , SCATTERED INT

   , RANDOMISED INT

   , RANDOM_STRING VARCHAR(50)

   , SMALL_VC VARCHAR(10)

   , PADDING  VARCHAR(10)

  )

  CLUSTERED BY (ID) INTO 256 BUCKETS

  STORED AS ORC

  TBLPROPERTIES (

  "orc.create.index"=3D"true",

  "orc.bloom.filter.columns"=3D"ID",

  "orc.bloom.filter.fpp"=3D"0.05",

  "orc.compress"=3D"SNAPPY",

  "orc.stripe.size"=3D"16777216",

  "orc.row.index.stride"=3D"10000" )

  """

   HiveContext.sql(sqltext)

  //

  // Put data in Hive table. Clean up is already done

  //

  sqltext =3D """

  INSERT INTO TABLE oraclehadoop.dummy

  SELECT

          ID

        , CLUSTERED

        , SCATTERED

        , RANDOMISED

        , RANDOM_STRING

        , SMALL_VC

        , PADDING

  FROM tmp

  """

   HiveContext.sql(sqltext)

  println ("\nFinished at"); sqlContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)

  sys.exit()

 }

}


If you look at Figure 3 you will the status of the job broken into *Active
Jobs* and *Completed Jobs *respectively. The description is pretty smart.
It tells you which line of code was executed. For example =E2=80=9Ccollect =
at
ETL_scratchpad_dummy.scala:24=E2=80=9D refers to line 24 of the code which =
is below:

println ("\nStarted at"); sqlContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)


This Job (Job Id 0) is already completed

On the other hand Active Job Id 1 =E2=80=9Csql at ETL_scratchpad_dummy.scal=
a:87=E2=80=9D is
currently running at line 87 of the code which is

  sqltext =3D """

  INSERT INTO TABLE oraclehadoop.dummy

  SELECT

          ID

        , CLUSTERED

        , SCATTERED

        , RANDOMISED

        , RANDOM_STRING

        , SMALL_VC

        , PADDING

  FROM tmp

  """

   HiveContext.sql(sqltext)


We can look at this job further by looking at the active job session in GUI
though stages


*Figure 4: Drilling down to execution*


 ...................


HTH

Dr Mich Talebzadeh


LinkedIn * https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6=
zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO=
ABUrV8Pw>*


http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


On 21 July 2016 at 12:27, Joaquin Alzola <Joaquin.Alzola@lebara.com> wrote:

> You have the same as link 1 but in English?
>
>    - spark-questions-concepts
>    <http://litaotao.github.io/spark-questions-concepts?s=3Dgmail>
>    - deep-into-spark-exection-model
>    <http://litaotao.github.io/deep-into-spark-exection-model?s=3Dgmail>
>
> Seems really interesting post but in Chinese. I suppose google translate
> suck on the translation.
>
>
>
>
>
> *From:* Taotao.Li [mailto:charles.upboy@gmail.com]
> *Sent:* 21 July 2016 04:04
> *To:* Jean Georges Perrin <jgp@jgp.net>
> *Cc:* Sachin Mittal <sjmittal@gmail.com>; user <user@spark.apache.org>
> *Subject:* Re: Understanding spark concepts cluster, master, slave, job,
> stage, worker, executor, task
>
>
>
> Hi, Sachin,  here are two posts about the basic concepts about spark:
>
>
>
>    - spark-questions-concepts
>    <http://litaotao.github.io/spark-questions-concepts?s=3Dgmail>
>    - deep-into-spark-exection-model
>    <http://litaotao.github.io/deep-into-spark-exection-model?s=3Dgmail>
>
>
>
> And, I fully recommend databrick's post:
> https://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.h=
tml
>
>
>
>
>
> On Thu, Jul 21, 2016 at 1:36 AM, Jean Georges Perrin <jgp@jgp.net> wrote:
>
> Hey,
>
>
>
> I love when questions are numbered, it's easier :)
>
>
>
> 1) Yes (but I am not an expert)
>
> 2) You don't control... One of my process is going to 8k tasks, so...
>
> 3) Yes, if you have HT, it double. My servers have 12 cores, but HT, so i=
t
> makes 24.
>
> 4) From my understanding: Slave is the logical computational unit and
> Worker is really the one doing the job.
>
> 5) Dunnoh
>
> 6) Dunnoh
>
>
>
> On Jul 20, 2016, at 1:30 PM, Sachin Mittal <sjmittal@gmail.com> wrote:
>
>
>
> Hi,
>
> I was able to build and run my spark application via spark submit.
>
> I have understood some of the concepts by going through the resources at
> https://spark.apache.org but few doubts still remain. I have few specific
> questions and would be glad if someone could share some light on it.
>
> So I submitted the application using spark.master    local[*] and I have =
a
> 8 core PC.
>
>
> - What I understand is that application is called as job. Since mine had
> two stages it gets divided into 2 stages and each stage had number of tas=
ks
> which ran in parallel.
>
> Is this understanding correct.
>
>
>
> - What I notice is that each stage is further divided into 262 tasks From
> where did this number 262 came from. Is this configurable. Would increasi=
ng
> this number improve performance.
>
> - Also I see that the tasks are run in parallel in set of 8. Is this
> because I have a 8 core PC.
>
> - What is the difference or relation between slave and worker. When I did
> spark-submit did it start 8 slaves or worker threads?
>
> - I see all worker threads running in one single JVM. Is this because I
> did not start  slaves separately and connect it to a single master cluste=
r
> manager. If I had done that then each worker would have run in its own JV=
M.
>
> - What is the relationship between worker and executor. Can a worker have
> more than one executors? If yes then how do we configure that. Does all
> executor run in the worker JVM and are independent threads.
>
> I suppose that is all for now. Would appreciate any response.Will add
> followup questions if any.
>
> Thanks
>
> Sachin
>
>
>
>
>
>
>
>
>
> --
>
> *___________________*
>
> Quant | Engineer | Boy
>
> *___________________*
>
> *blog*:    http://litaotao.github.io
> <http://litaotao.github.io?utm_source=3Dspark_mail>
>
> *github*: www.github.com/litaotao
> This email is confidential and may be subject to privilege. If you are no=
t
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>

--001a1137b98ecd28ab05382400bd
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>I started putting together some Performance=C2=A0and =
Tuning=C2=A0guide for Spark starting from the simplest operation Local and =
Standalone modes but sounds like I never have the time to finish it!</div><=
div><br></div><div>This is some stuff but is in word and wrapped together i=
n some arbitrary way.=C2=A0 Anyway if you think it is useful let me know an=
d I try to finish it :)</div><div><br></div><div>Some of the points=C2=A0we=
 have already discussed in this user group or part of wider available liter=
ature. It is aimed at practitioner.</div><div><br></div><div><font color=3D=
"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt"><b><span lang=3D"EN=
-US" style=3D"font-family:&quot;Times New Roman&quot;,serif;font-size:18pt"=
><font color=3D"#000000">Introduction</font></span></b></p><font color=3D"#=
000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt"><font size=3D"3"><s=
pan lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot;,sans-=
serif">According to Spark Website, </span><span lang=3D"EN-US"><a href=3D"h=
ttp://spark.apache.org/"><span style=3D"font-family:&quot;Arial&quot;,sans-=
serif"><font color=3D"#0563c1">Apache
Spark</font></span></a></span><span lang=3D"EN-US" style=3D"color:black;fon=
t-family:&quot;Arial&quot;,sans-serif"> </span><span lang=3D"EN" style=3D"c=
olor:black;font-family:&quot;Arial&quot;,sans-serif">is a fast and general =
purpose
engine for large-scale data processing. It is written mostly in Scala, and
provides APIs for Scala, Java and Python. It is fully compatible with Hadoo=
p
Distributed File System (HDFS), however it extends on Hadoop=E2=80=99s core
functionality by providing in-memory cluster computation among other things=
</span></font></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt"><font size=3D"3"><s=
pan lang=3D"EN" style=3D"color:black;font-family:&quot;Arial&quot;,sans-ser=
if">Providing in-memory
capabilities is probably one of the most import aspects of Spark that allow=
s
one to do computation in-memory. It also supports an advanced scheduler bas=
ed
on directed acyclic graph (</span><span lang=3D"EN-US"><a href=3D"https://e=
n.wikipedia.org/wiki/Directed_acyclic_graph"><span lang=3D"EN" style=3D"fon=
t-family:&quot;Arial&quot;,sans-serif"><font color=3D"#0563c1">DAG</font></=
span></a></span><span lang=3D"EN" style=3D"color:black;font-family:&quot;Ar=
ial&quot;,sans-serif">). These capabilities allow Spark to be used as an
advanced query engine with the help of Spark shell and Spark SQL. For near =
real
time data processing Spark Streaming can be used. Another important but oft=
en
understated capability of Spark is deploying it to be used as an advanced
execution engine for other Hadoop tools such as Hive.</span></font></p><fon=
t color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt"><span lang=3D"EN" s=
tyle=3D"color:black;font-family:&quot;Arial&quot;,sans-serif"><font size=3D=
"3">Like most of the tools
in Hadoop ecosystem, Spark will require careful tuning to get the most out =
of
it. </font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt"><span lang=3D"EN" s=
tyle=3D"color:black;font-family:&quot;Arial&quot;,sans-serif"><font size=3D=
"3">Thus, in these brief
notes we will aim to address these points to ensure that you create an
infrastructure for Spark, which is not only performant but also scalable fo=
r
your needs.</font></span></p><font color=3D"#000000" face=3D"Times New Roma=
n" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><b><span style=3D"font-family:&quot;=
Times New Roman&quot;,serif;font-size:18pt"><font color=3D"#000000">Why Spa=
rk</font></span></b></p><font color=3D"#000000" face=3D"Times New Roman" si=
ze=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt"><span lang=3D"EN" s=
tyle=3D"color:black;font-family:&quot;Arial&quot;,sans-serif"><font size=3D=
"3">The Hadoop ecosystem is
nowadays crowded with a variety of offerings. Some of them are complementar=
y
and others are competing with each other. Spark is unique in that in a spac=
e of
relatively short time it has grown much in its popularity and as of today i=
s
one of the most popular tools in the Hadoop ecosystem.</font></span></p><fo=
nt color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span lang=3D"EN-US" style=3D"font-f=
amily:&quot;Arial&quot;,sans-serif"><font size=3D"3"><font color=3D"#000000=
">The
fundamental technology of Hadoop using Map-Reduce algorithm as its core
execution engine gave rise to deployment of other methods. Although Map-Red=
uce
was and still is an incredible technology, it lacked the speed and performa=
nce
required for certain business needs like dealing with real-time analytics.
Spark was developed ground up to address these concerns. <span>=C2=A0</span=
></font></font></span></p><font color=3D"#000000" face=3D"Times New Roman" =
size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span lang=3D"EN-US" style=3D"font-f=
amily:&quot;Arial&quot;,sans-serif"><font color=3D"#000000" size=3D"3">=C2=
=A0</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><b><span lang=3D"EN-US" style=3D"fon=
t-family:&quot;Times New Roman&quot;,serif;font-size:18pt"><font color=3D"#=
000000">Overview of Spark
Architecture</font></span></b></p><font color=3D"#000000" face=3D"Times New=
 Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><b><span lang=3D"EN-US" style=3D"fon=
t-family:&quot;Times New Roman&quot;,serif;font-size:18pt"><font color=3D"#=
000000">=C2=A0</font></span></b></p><font color=3D"#000000" face=3D"Times N=
ew Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span lang=3D"EN-US" style=3D"font-f=
amily:&quot;Arial&quot;,sans-serif"><font color=3D"#000000" size=3D"3">Spar=
k
much like many other tools runs a set of instructions summarized in the for=
m of
an application. An application consists of a Driver Program that is respons=
ible
for submitting, running and monitoring the code. </font></span></p><font co=
lor=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span lang=3D"EN-US" style=3D"font-f=
amily:&quot;Arial&quot;,sans-serif"><font color=3D"#000000" size=3D"3">=C2=
=A0</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span lang=3D"EN-US" style=3D"font-f=
amily:&quot;Arial&quot;,sans-serif"><font size=3D"3"><font color=3D"#000000=
">
=C2=A0Spark can distribute the
work load across what is known as cluster. In other words, S</font><span st=
yle=3D"color:rgb(29,31,34)">park applications run as independent sets of pr=
ocesses on
a cluster. Process is an application running on UNIX/Linux system</span></f=
ont></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span lang=3D"EN-US" style=3D"color:=
rgb(29,31,34);font-family:&quot;Arial&quot;,sans-serif"><font size=3D"3">=
=C2=A0</font></span></p><font color=3D"#000000" face=3D"Times New Roman" si=
ze=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:black;font-fami=
ly:&quot;Arial&quot;,sans-serif"><font size=3D"3">A <i><span>cluster</span>=
 </i>is a
collection of servers, called nodes that communicate with each other to mak=
e a
set of services highly available to the applications running on them.</font=
></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:black;font-fami=
ly:&quot;Arial&quot;,sans-serif"><font size=3D"3">=C2=A0</font></span></p><=
font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:black;font-fami=
ly:&quot;Arial&quot;,sans-serif"><font size=3D"3">Before going any further =
one can
equate a node with a physical host, a VM host or any other resource capable=
 of
providing RAM and core. Some refer to nodes as machines as well.</font></sp=
an></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:black;font-fami=
ly:&quot;Arial&quot;,sans-serif"><font size=3D"3">=C2=A0</font></span></p><=
font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><b><span style=3D"color:rgb(84,84,84=
);font-family:&quot;Arial&quot;,sans-serif"><font size=3D"3">Spark
Operations</font></span></b></p><font color=3D"#000000" face=3D"Times New R=
oman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><b><span style=3D"color:rgb(84,84,84=
);font-family:&quot;Arial&quot;,sans-serif"><font size=3D"3">=C2=A0</font><=
/span></b></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:black;font-fami=
ly:&quot;Arial&quot;,sans-serif"><font size=3D"3">Spark takes advantages of=
 a
cluster by dividing the workload across this cluster and executing operatio=
ns
in parallel to speed up the processing. To affect this Spark provides <i>as=
 part of its core architecture </i>an
abstraction layer called a <b><i>Resilient Distributed Dataset (RDD</i>). <=
/b>Simply
put an RDD is a selection of elements (like a sequence, a text file, a CSV
file, data coming in from streaming sources like Twitter, Kafka and so fort=
h)
that one wants to work with. </font></span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:rgb(84,84,84);f=
ont-family:&quot;Arial&quot;,sans-serif"><font size=3D"3">=C2=A0</font></sp=
an></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:black;font-fami=
ly:&quot;Arial&quot;,sans-serif"><font size=3D"3">What an RDD does is to pa=
rtition
that data across the nodes of the Spark cluster to take advantage of parall=
el
processing. </font></span></p><font color=3D"#000000" face=3D"Times New Rom=
an" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:rgb(84,84,84);f=
ont-family:&quot;Arial&quot;,sans-serif"><font size=3D"3">=C2=A0</font></sp=
an></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font size=3D"3"><i><span style=3D"c=
olor:black;font-family:&quot;Arial&quot;,sans-serif">RDDs in Spark are immu=
table </span></i><span style=3D"color:black;font-family:&quot;Arial&quot;,s=
ans-serif">meaning that <u>they cannot be changed, but can be
acted upon</u> to create other RDDs and result sets. RDDs can also be cache=
d in
memory (to be precise in the memory allocated to Spark) across multiple nod=
es
for faster parallel operations. </span></font></p><font color=3D"#000000" f=
ace=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:black;font-fami=
ly:&quot;Arial&quot;,sans-serif"><font size=3D"3">=C2=A0</font></span></p><=
font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:black;font-fami=
ly:&quot;Arial&quot;,sans-serif"><font size=3D"3">So that takes care of dat=
a. There
is a second </font></span><span lang=3D"EN-US" style=3D"color:black;font-fa=
mily:&quot;Helvetica&quot;,sans-serif;font-size:10.5pt">abstraction in Spar=
k known as <em><b><span style=3D"font-family:&quot;Helvetica&quot;,sans-ser=
if">shared
variables</span></b></em> that can be used in parallel operations. By defau=
lt,
when Spark runs a function in parallel as a set of tasks on different nodes=
, it
ships a copy of each variable used in the function to each task. Sometimes,=
 a
variable need to be shared across tasks, or between tasks and the driver
program. <u>Spark supports two types of shared variables</u>: <em><span sty=
le=3D"font-family:&quot;Helvetica&quot;,sans-serif">broadcast variables</sp=
an></em>,
which can be used to cache a value in memory on all nodes, and <em><span st=
yle=3D"font-family:&quot;Helvetica&quot;,sans-serif">accumulators</span></e=
m>, which are
variables that are only =E2=80=9Cadded=E2=80=9D to, such as counters and su=
ms. We will cover
them later</span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:black;font-fami=
ly:&quot;Arial&quot;,sans-serif"><font size=3D"3">=C2=A0</font></span></p><=
font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"color:black;font-fami=
ly:&quot;Arial&quot;,sans-serif"><font size=3D"3">We have been mentioning S=
park clusters
but clusters can be configured differently as well through what is known as=
 the
configuration of cluster manager. To this effect, Spark currently supports =
the
following configurations:</font></span></p><font color=3D"#000000" face=3D"=
Times New Roman" size=3D"3">

</font><ul style=3D"list-style-type:disc;direction:ltr"><li style=3D"font-s=
tyle:normal;font-weight:normal"><p style=3D"background:white;line-height:15=
pt;font-style:normal;font-weight:normal;margin-top:0cm;margin-bottom:0pt"><=
b><span lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial=
&quot;,sans-serif">Spark Local</span></b><span lang=3D"EN-US" style=3D"colo=
r:rgb(29,31,34);font-family:&quot;Arial&quot;,sans-serif"> - Spark runs
on the local host. This is the simplest set up and best suited for learners=
 who
want to understand different concepts of Spark and those performing unit
testing. </span></p></li><li style=3D"color:rgb(29,31,34);font-family:&quot=
;Times New Roman&quot;,serif;font-size:12pt;font-style:normal;font-weight:n=
ormal"><p style=3D"background:white;color:rgb(0,0,0);line-height:15pt;font-=
family:&quot;Calibri&quot;,sans-serif;font-size:11pt;font-style:normal;font=
-weight:normal;margin-top:0cm;margin-bottom:0pt"><b><span lang=3D"EN-US" st=
yle=3D"color:rgb(29,31,34);font-family:&quot;Arial&quot;,sans-serif">Spark =
Standalone
</span></b><span lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&q=
uot;Arial&quot;,sans-serif">=E2=80=93
a simple cluster manager included with Spark that makes it easy to set up a
cluster.</span></p></li><li style=3D"color:rgb(85,85,85);font-family:&quot;=
Times New Roman&quot;,serif;font-size:12pt;font-style:normal;font-weight:no=
rmal"><p style=3D"background:white;color:rgb(0,0,0);font-family:&quot;Calib=
ri&quot;,sans-serif;font-size:11pt;font-style:normal;font-weight:normal;mar=
gin-top:0cm;margin-bottom:0pt"><b><span lang=3D"EN-US" style=3D"color:black=
;font-family:&quot;Arial&quot;,sans-serif">YARN Cluster Mode,</span></b><sp=
an lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot;,sans-s=
erif"> the Spark driver runs inside an
application master process which is managed by YARN on the cluster, and the
client can go away after initiating the application. This is invoked with <=
/span><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier N=
ew&quot;">=E2=80=93master yarn</span><span lang=3D"EN-US" style=3D"color:bl=
ack;font-family:&quot;Arial&quot;,sans-serif"> and </span><span lang=3D"EN-=
US" style=3D"background:white;padding:0cm;border:1pt windowtext;color:rgb(6=
8,68,68);font-family:&quot;Courier New&quot;">--deploy-mode
cluster</span></p></li><li style=3D"color:rgb(29,31,34);font-family:&quot;T=
imes New Roman&quot;,serif;font-size:12pt;font-style:normal;font-weight:nor=
mal"><p style=3D"background:white;color:rgb(0,0,0);line-height:15pt;font-fa=
mily:&quot;Calibri&quot;,sans-serif;font-size:11pt;font-style:normal;font-w=
eight:normal;margin-top:0cm;margin-bottom:0pt"><b><span lang=3D"EN-US" styl=
e=3D"color:black;font-family:&quot;Arial&quot;,sans-serif">YARN Client Mode=
</span></b><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Aria=
l&quot;,sans-serif">, the driver runs in the client process,
and the application master is only used for requesting resources from YARN.
Unlike </span><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;A=
rial&quot;,sans-serif">Spark standalone </span><span lang=3D"EN-US" style=
=3D"color:black;font-family:&quot;Arial&quot;,sans-serif">mode, in which th=
e master=E2=80=99s address is
specified in the</span><span lang=3D"EN-US" style=3D"color:black;font-famil=
y:&quot;Times New Roman&quot;,serif;font-size:12pt"> </span><span lang=3D"E=
N-US" style=3D"color:black;font-family:&quot;Courier New&quot;;font-size:10=
pt">--master</span><span lang=3D"EN-US" style=3D"color:black;font-family:&q=
uot;Times New Roman&quot;,serif;font-size:12pt"> </span><span lang=3D"EN-US=
" style=3D"color:black;font-family:&quot;Arial&quot;,sans-serif">parameter,=
 in YARN mode the
ResourceManager=E2=80=99s address is picked up from the Hadoop configuratio=
n.</span><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Times =
New Roman&quot;,serif;font-size:12pt"> </span><span lang=3D"EN-US" style=3D=
"color:black;font-family:&quot;Arial&quot;,sans-serif">Thus, the </span><sp=
an lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier New&quot;"=
>--master</span><span lang=3D"EN-US" style=3D"color:black;font-family:&quot=
;Times New Roman&quot;,serif;font-size:12pt"> parameter is </span><span lan=
g=3D"EN-US" style=3D"color:black;font-family:&quot;Courier New&quot;">yarn<=
/span><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Times New=
 Roman&quot;,serif;font-size:12pt">. </span><span lang=3D"EN-US" style=3D"c=
olor:black;font-family:&quot;Arial&quot;,sans-serif">This is invoked with</=
span><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Times New =
Roman&quot;,serif;font-size:12pt"> </span><span lang=3D"EN-US" style=3D"bac=
kground:white;padding:0cm;border:1pt windowtext;color:rgb(68,68,68);font-fa=
mily:&quot;Courier New&quot;">--deploy-mode
client</span></p></li></ul><font color=3D"#000000" face=3D"Times New Roman"=
 size=3D"3">


</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><b=
><span lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&=
quot;,sans-serif;font-size:9.5pt">- client mode requires the
process that launched the app remain alive. Meaning the host where it lives=
 has
to stay alive, and it may not be super-friendly to ssh sessions dying, for
example, unless you use nohup.<br>
<br>
- client mode driver logs are printed to stderr by default. yes you can cha=
nge
that, but in cluster mode, they&#39;re all collected by yarn without any us=
er
intervention.<br>
<br>
- if your edge node (from where the app is launched) isn&#39;t really part =
of the
cluster (e.g., lives in an outside network with firewalls or higher latency=
),
you may run into issues.<br>
<br>
- in cluster mode, your driver&#39;s cpu / memory usage is accounted for in=
 YARN;
this matters if your edge node is part of the cluster (and could be running
yarn containers), since in client mode your driver will potentially use a l=
ot
of memory / cpu.<br>
<br>
- finally, in cluster mode YARN can restart your application without user
interference. this is useful for things that need to stay up (think a long
running streaming job, for example).</span></b></p><font color=3D"#000000" =
face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><b=
><span lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&=
quot;,sans-serif;font-size:9.5pt">If your client is not close to
the cluster (e.g. your PC) then you definitely want to go cluster to improv=
e
performance. If your client is close to the cluster (e.g. an edge node) the=
n
you could go either client or cluster.=C2=A0 Note that by going client, mor=
e
resources are going to be used on the edge node.</span></b></p><font color=
=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">In this part one, we will confine ourselves=
 with <b>Spark on Local host </b>and will leave the
other two to other parts.</font></span></p><font color=3D"#000000" face=3D"=
Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt"><b><span lang=3D"EN=
-US" style=3D"color:rgb(29,31,34);font-family:&quot;Times New Roman&quot;,s=
erif;font-size:18pt">Spark Local Mode</span></b></p><font color=3D"#000000"=
 face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">Spark Local Mode is the simplest configurat=
ion of Spark that
does not require a Cluster. The user on the local host can launch and
experiment with Spark.</font></span></p><font color=3D"#000000" face=3D"Tim=
es New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000=
000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">In this mode the driver program (SparkSubmi=
t), the resource
manager and executor all exist within the same JVM. The JVM itself is the
worker thread</font></span></p><font color=3D"#000000" face=3D"Times New Ro=
man" size=3D"3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">When you use
spark-shell or for=C2=A0that matter spark-sql, you are staring spark-submit
under the bonnet. These two shells are created to make life easier to work =
on
Spark.</font></span></p><font color=3D"#000000" face=3D"Times New Roman" si=
ze=3D"3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000000" fac=
e=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">However, if you look
at what=C2=A0$SPARK_HOME/bin/spark-shell does in the script,=C2=A0you will
notice my point:</font></span></p><font color=3D"#000000" face=3D"Times New=
 Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-=
serif"><br><font size=3D"3">
<br>
</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:blue;font-family:&quot;Courier New&quot;"><font s=
ize=3D"3">&quot;${SPARK_HOME}&quot;/bin/spark-submit
--class org.apache.spark.repl.Main --name &quot;Spark shell&quot;
&quot;$@&quot;<br>
<br>
</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000000" fac=
e=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">So that is basically
spark-submit JVM invoked with the name &quot;Spark shell&quot;</font></span=
></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000000" fac=
e=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">Since it is using
spark-submit it takes all the parameters related to spark-submit. However,
remember that these two shells are created for read=E2=80=93eval=E2=80=93pr=
int loop (REPL) and
they default to Local mode. You cannot use them for example in YARN cluster
mode. </font></span></p><font color=3D"#000000" face=3D"Times New Roman" si=
ze=3D"3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000000" fac=
e=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">Some default
parameters can be changed. For example, the default Web GUI for Spark is 40=
40.
However, I start it with 55555 and modified it to call it a different name<=
/font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3=
">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000000" fac=
e=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 23.25pt 0pt 0cm"><span lang=
=3D"EN-US" style=3D"color:blue;font-family:&quot;Courier New&quot;"><font s=
ize=3D"3">&quot;${SPARK_HOME}&quot;/bin/spark-submit
--conf &quot;spark.ui.port=3D55555&quot; --class org.apache.spark.repl.Main
--name &quot;my own Spark shell&quot; &quot;$@&quot;</font></span></p><font=
 color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3"><br></font></span></p><p style=3D"backgroun=
d:white;margin:0cm 0cm 0pt;line-height:15pt"><span lang=3D"EN-US" style=3D"=
color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-serif"><font size=3D=
"3">Before going further let us understand the concept of cores and threads=
.
These days we talk about cores more than CPUs. Each CPU comes with a number=
 of
cores. </font></span></p><font color=3D"#000000" face=3D"Times New Roman" s=
ize=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">Simply put to work out the number of thread=
s you can do this:</font></span></p><font color=3D"#000000" face=3D"Times N=
ew Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:blue;font-family:&quot;Courier New&quot;"=
><font size=3D"3"><br></font></span></p><p style=3D"background:white;margin=
:0cm 0cm 0pt;line-height:15pt"><span lang=3D"EN-US" style=3D"color:blue;fon=
t-family:&quot;Courier New&quot;"><font size=3D"3">cat
/proc/cpuinfo|grep processor|wc -l</font></span></p><font color=3D"#000000"=
 face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3"><span><br></span></font></span></p><p style=
=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><span lang=3D"EN-=
US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;,sans-serif">=
<font size=3D"3"><span>=C2=A0</span>Which for me it returns 12 and
that is all I need to know without worrying what physical cores, logical co=
res
and CPU really mean as these definitions may vary from one hardware vendor =
to
another.</font></span></p><font color=3D"#000000" face=3D"Times New Roman" =
size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">On local mode with you have</font></span></=
p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt;line-heig=
ht:15pt"><span lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quo=
t;Courier New&quot;"><font size=3D"3">--master local</font></span></p><font=
 color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000=
000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><f=
ont size=3D"3"><span lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-famil=
y:&quot;Arial&quot;,sans-serif">This will start with one (worker) <b>thread=
 </b>or equivalent to =E2=80=93master
local[1]. You can start by more than one thread by specifying the number of
threads <i>k</i> in </span><span lang=3D"EN-US" style=3D"color:rgb(85,85,85=
);font-family:&quot;Courier New&quot;">=E2=80=93master local[k].</span><spa=
n lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quot;=
,sans-serif"> You can also start using all
available threads with </span><span lang=3D"EN-US" style=3D"color:rgb(85,85=
,85);font-family:&quot;Courier New&quot;">=E2=80=93master local[*]. </span>=
<span lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&q=
uot;,sans-serif">The degree of parallelism is defined by the number of thre=
ads <i>k</i>.</span></font></p><font color=3D"#000000" face=3D"Times New Ro=
man" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><f=
ont size=3D"3"><span lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-famil=
y:&quot;Arial&quot;,sans-serif">In <i>Local mode</i>, you do not need to st=
art master and slaves/workers.
In this mode it is pretty simple and you can run as many JVMs (spark-submit=
) as
your resources allow (resource meaning memory and cores). Additionally, the=
 GUI
starts by default on port 4040, next one on 4041 and so forth unless you
specifically start it with </span><span lang=3D"EN-US" style=3D"color:blue;=
font-family:&quot;Courier New&quot;">--conf
&quot;spark.ui.port=3Dnnnnn&quot;</span></font></p><font color=3D"#000000" =
face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(85,85,85);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">Remember this is all about testing your app=
s. It is NOT a performance
test. What it allows you is to test multiple apps concurrently and more
importantly gets you started and understand various configuration parameter=
s
that Spark uses together with spark-submit executable</font></span></p><fon=
t color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">You can of course use spark-shell and spark=
-sql utilities. These
in turn rely on spark-submit executable to run certain variations of the JV=
M. In
other words, you are still executing spark-submit. You can pass parameters =
to
spark-submit with an example shown below:</font></span></p><font color=3D"#=
000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font color=3D"#=
000000" size=3D"3">${SPARK_HOME}/bin/spark-submit
\</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D=
"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--packages
com.databricks:spark-csv_2.11:1.3.0 \</font></font></span></p><font color=
=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--driver-memory 2G =
\</font></font></span></p><font color=3D"#000000" face=3D"Times New Roman" =
size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--num-executors 1 \=
</font></font></span></p><font color=3D"#000000" face=3D"Times New Roman" s=
ize=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--executor-memory 2=
G \</font></font></span></p><font color=3D"#000000" face=3D"Times New Roman=
" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--master local \</f=
ont></font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--executor-cores 2 =
\</font></font></span></p><font color=3D"#000000" face=3D"Times New Roman" =
size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--conf
&quot;spark.scheduler.mode=3DFAIR&quot; \</font></font></span></p><font col=
or=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--conf
&quot;spark.executor.extraJavaOptions=3D-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps&quot; \</font></font></span></p><font color=3D"#0000=
00" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--jars
/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \</font></f=
ont></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--class
&quot;${FILE_NAME}&quot; \</font></font></span></p><font color=3D"#000000" =
face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 </span><span>=C2=A0=C2=A0=C2=A0=C2=A0</span>--conf=
 &quot;spark.ui.port=3D4040=E2=80=9D \</font></font></span></p><font color=
=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--conf
&quot;spark.driver.port=3D54631&quot; \</font></font></span></p><font color=
=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--conf
&quot;spark.fileserver.port=3D54731&quot; \</font></font></span></p><font c=
olor=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--conf
&quot;spark.blockManager.port=3D54832&quot; \</font></font></span></p><font=
 color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--conf
&quot;spark.kryoserializer.buffer.max=3D512&quot; \</font></font></span></p=
><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 <=
/span><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0</span>${JAR_FI=
LE} \</font></font></span></p><font color=3D"#000000" face=3D"Times New Rom=
an" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt"><span la=
ng=3D"EN-US" style=3D"font-family:&quot;Courier New&quot;"><font size=3D"3"=
><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>&gt;&gt; ${LOG_FILE=
}</font></font></span></p><font color=3D"#000000" face=3D"Times New Roman" =
size=3D"3">

</font><p style=3D"margin:1em 0cm"><strong><span lang=3D"EN" style=3D"lette=
r-spacing:0.15pt;font-family:&quot;Arial&quot;,sans-serif;font-size:11pt"><=
font color=3D"#000000">Note that in the above example I am only using modes=
t resources. This is
intentional to ensure that resources are available for the other Spark jobs
that I may be testing on this standalone node.</font></span></strong></p><f=
ont color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:1em 0cm"><strong><span lang=3D"EN" style=3D"lette=
r-spacing:0.15pt;font-family:&quot;Arial&quot;,sans-serif;font-size:11pt"><=
font color=3D"#000000">Alternatively, you can specify some of these paramet=
ers when you are
creating a new SparkConf</font></span></strong></p><font color=3D"#000000" =
face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm"><strong><spa=
n lang=3D"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Courier New&=
quot;;font-size:11pt"><font color=3D"#000000">val sparkConf =3D new SparkCo=
nf().</font></span></strong></p><font color=3D"#000000" face=3D"Times New R=
oman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm"><strong><spa=
n lang=3D"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Courier New&=
quot;;font-size:11pt"><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>setAppName(&quot;CEP_streaming&quot;).</font></span></strong></p><fo=
nt color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm"><strong><spa=
n lang=3D"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Courier New&=
quot;;font-size:11pt"><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>setMaster(&quot;=
local&quot;).</font></span></strong></p><font color=3D"#000000" face=3D"Tim=
es New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm"><strong><spa=
n lang=3D"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Courier New&=
quot;;font-size:11pt"><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>Set(=E2=80=9Cnum=
.executors=E2=80=9D,
=E2=80=9C1=E2=80=9D).</font></span></strong></p><font color=3D"#000000" fac=
e=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm"><strong><spa=
n lang=3D"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Courier New&=
quot;;font-size:11pt"><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span><span>=C2=A0</span>set=
(&quot;spark.executor.memory&quot;, &quot;2G&quot;).</font></span></strong>=
</p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm"><strong><spa=
n lang=3D"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Courier New&=
quot;;font-size:11pt"><font color=3D"#000000"><span>=C2=A0</span><span>=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>set(=
=E2=80=9Cspark.executor.cores=E2=80=9D, =E2=80=9C2=E2=80=9D).</font></span>=
</strong></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm"><strong><spa=
n lang=3D"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Courier New&=
quot;;font-size:11pt"><font color=3D"#000000"><span>=C2=A0</span><span>=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>set(=
&quot;spark.cores.max&quot;,
&quot;2&quot;).</font></span></strong></p><font color=3D"#000000" face=3D"T=
imes New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm"><strong><spa=
n lang=3D"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Courier New&=
quot;;font-size:11pt"><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>set(&quot;spark.driver.allowMultipleContexts&quot;, &quot;true&quot;=
).</font></span></strong></p><font color=3D"#000000" face=3D"Times New Roma=
n" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm"><strong><spa=
n lang=3D"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Courier New&=
quot;;font-size:11pt"><font color=3D"#000000"><span>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>set(&quot;spark.hadoop.validateOutputSpecs&quot;, &quot;false&quot;)=
</font></span></strong></p><font color=3D"#000000" face=3D"Times New Roman"=
 size=3D"3">

</font><p style=3D"margin:1em 0cm"><strong><span lang=3D"EN" style=3D"lette=
r-spacing:0.15pt;font-family:&quot;Arial&quot;,sans-serif;font-size:11pt"><=
font color=3D"#000000">=C2=A0</font></span></strong></p><font color=3D"#000=
000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:1em 0cm"><strong><span lang=3D"EN" style=3D"lette=
r-spacing:0.15pt;font-family:&quot;Arial&quot;,sans-serif;font-size:11pt"><=
font color=3D"#000000">You can practically run most of your unit testing wi=
th Local mode and
deploy variety of options including running SQL queries, reading data from =
CSV
files, writing to HDFS, creating Hive tables including ORC tables and doing
Spark Streaming.</font></span></strong></p><font color=3D"#000000" face=3D"=
Times New Roman" size=3D"3">

</font><p style=3D"margin:1em 0cm"><strong><span lang=3D"EN" style=3D"lette=
r-spacing:0.15pt;font-family:&quot;Arial&quot;,sans-serif;font-size:11pt"><=
font color=3D"#000000">The components of a Spark App</font></span></strong>=
</p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:1em 0cm"><strong><span lang=3D"EN" style=3D"lette=
r-spacing:0.15pt;font-family:&quot;Arial&quot;,sans-serif;font-size:11pt"><=
font color=3D"#000000">Although this may be of little relevance to Local mo=
de, it would be
beneficial to clarify a number of Spark terminologies here.</font></span></=
strong></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:1em 0cm"><span lang=3D"EN" style=3D"letter-spacin=
g:0.15pt;font-family:&quot;Arial&quot;,sans-serif;font-size:11pt"><font col=
or=3D"#000000">A Spark
application consists of a driver program and a list of executors. The drive=
r
program is the main program, which coordinates the executors to run the Spa=
rk
application. Executors are worker nodes&#39; processes in charge of running
individual tasks in a given Spark job. The executors run the tasks assigned=
 by
the driver program.<span>=C2=A0 </span>In Local mode, the
driver program runs inside the JVM and the driver program is running on the=
 local
machine. There is only one executor and it is called <i>driver</i> and the
tasks are executed by the threads locally as well. This single executor wil=
l be
started with <i>k</i> threads.</font></span></p><font color=3D"#000000" fac=
e=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:1em 0cm"><font color=3D"#000000"><span lang=3D"EN=
" style=3D"letter-spacing:0.15pt;font-family:&quot;Arial&quot;,sans-serif;f=
ont-size:11pt">Local mode is different than Standalone mode that uses Spark=
 in-built
cluster set-up</span><b><span lang=3D"EN" style=3D"letter-spacing:0.15pt;fo=
nt-family:&quot;Arial&quot;,sans-serif;font-size:11pt">.</span></b></font><=
/p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:1em 0cm"><font color=3D"#000000"><b><span lang=3D=
"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Arial&quot;,sans-seri=
f;font-size:11pt">Driver Program: </span></b><span lang=3D"EN" style=3D"let=
ter-spacing:0.15pt;font-family:&quot;Arial&quot;,sans-serif;font-size:11pt"=
>The driver is the process started by spark-submit. The
application relies on the initial Spark specific environment setting in the
shell that the application is started to create what is known as <i>SparkCo=
ntext object</i>. SparkContext tells
the driver program how to access the Spark cluster among other things. It i=
s a
separate Java process. It is identified as <i>SparkSubmit</i>
in jps</span></font></p><font color=3D"#000000" face=3D"Times New Roman" si=
ze=3D"3">

</font><p style=3D"margin:1em 0cm"><font color=3D"#000000"><strong><span la=
ng=3D"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Arial&quot;,sans=
-serif;font-size:11pt">Standalone Master</span></strong><span lang=3D"EN" s=
tyle=3D"letter-spacing:0.15pt;font-family:&quot;Arial&quot;,sans-serif;font=
-size:11pt"> is not
required in Local mode</span></font></p><font color=3D"#000000" face=3D"Tim=
es New Roman" size=3D"3">

</font><p style=3D"margin:1em 0cm"><font color=3D"#000000"><strong><span la=
ng=3D"EN" style=3D"letter-spacing:0.15pt;font-family:&quot;Arial&quot;,sans=
-serif;font-size:11pt">Standalone Worker</span></strong><span lang=3D"EN" s=
tyle=3D"letter-spacing:0.15pt;font-family:&quot;Arial&quot;,sans-serif;font=
-size:11pt"> is not
required in Local </span></font></p><font color=3D"#000000" face=3D"Times N=
ew Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><f=
ont size=3D"3"><b><span lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-fa=
mily:&quot;Arial&quot;,sans-serif">Executor </span></b><span lang=3D"EN-US"=
 style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quot;,sans-serif">is =
the program
that is launched on the Worker when a Job starts executing</span></font></p=
><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><f=
ont size=3D"3"><b><span lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-fa=
mily:&quot;Arial&quot;,sans-serif">Tasks </span></b><span lang=3D"EN-US" st=
yle=3D"color:rgb(29,31,34);font-family:&quot;Arial&quot;,sans-serif">Each S=
park
application is broken down into stages and each stage is completed by one o=
r
more tasks. A task is a thread of execution that an executor runs on a sing=
le
node.</span></font></p><font color=3D"#000000" face=3D"Times New Roman" siz=
e=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><f=
ont size=3D"3"><b><span lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-fa=
mily:&quot;Arial&quot;,sans-serif">Cache</span></b><span lang=3D"EN-US" sty=
le=3D"color:rgb(29,31,34);font-family:&quot;Arial&quot;,sans-serif"> is the=
 memory
allocated to this Spark process</span></font></p><font color=3D"#000000" fa=
ce=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000=
000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">Going back to Figure 1, we notice Cache, Ex=
ecutor and Tasks.
These are as follows:</font></span></p><font color=3D"#000000" face=3D"Time=
s New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">Figure 2 below shows a typical Spark master
URL. Note the number of cores and Memory allocation for each worker. These =
are default
maximum on his host. Again these are the resource ceilings it does not mean
that they go and grab those values.</font></span></p><font color=3D"#000000=
" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan><font face=3D"Calibri"><font color=3D"#000000"><font size=3D"3">
=20
=20
 =20
 =20
 =20
 =20
 =20
 =20
 =20
 =20
 =20
 =20
 =20
 =20
=20
=20
=20

=20
</font></font></font></span></p><font color=3D"#000000" face=3D"Times New R=
oman" size=3D"3">

</font><p align=3D"center" style=3D"background:white;margin:0cm 0cm 0pt;tex=
t-align:center;line-height:15pt"><b><span lang=3D"EN-US" style=3D"color:bla=
ck;font-family:&quot;Arial&quot;,sans-serif"><font size=3D"3">Figure 2: A t=
ypical Spark master URL</font></span></b></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:blue;font-family:&quot;Arial&quot;,sans-s=
erif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">Note that as stated each worker grabs all t=
he available cores
and allocates the remaining memory on each host. However, these values are
somehow misleading and are not updated. So I would not worry too much about
what it says in this page.</font></span></p><font color=3D"#000000" face=3D=
"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000=
000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><u=
><span lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&=
quot;,sans-serif"><font size=3D"3">Configuring Spark parameters</font></spa=
n></u></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">To configure Spark shell parameters, you wi=
ll need to modify the
settings in $SPARK_HOME/conf/spark-env.sh script</font></span></p><font col=
or=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">Note that the shells in $SPARK_HOME/sbin ca=
ll
$SPARK_HOME/conf/spark-env.sh scripts. So if you modify this file, remember=
 to
restart your master and slaves=E2=80=99 routines</font></span></p><font col=
or=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p><span lang=3D"EN-US" style=3D"color:rgb(102,102,102);font-family:=
CalibreWeb-Light;font-size:15pt">Every Spark executor in an
application that has the same fixed number of cores and same fixed heap siz=
e.
The number of cores can be specified with the </span><code><span lang=3D"EN=
-US" style=3D"font-size:9pt"><font color=3D"#444444" face=3D"Lucida Console=
">--executor-cores</font></span></code><span lang=3D"EN-US" style=3D"color:=
rgb(102,102,102);font-family:CalibreWeb-Light;font-size:15pt">=C2=A0flag wh=
en invoking spark-submit, spark-shell, and pyspark
from the command line, or by setting the </span><code><span lang=3D"EN-US" =
style=3D"font-size:9pt"><font color=3D"#444444" face=3D"Lucida Console">spa=
rk.executor.cores</font></span></code><span lang=3D"EN-US" style=3D"color:r=
gb(102,102,102);font-family:CalibreWeb-Light;font-size:15pt"> property in t=
he </span><code><span lang=3D"EN-US" style=3D"font-size:9pt"><font color=3D=
"#444444" face=3D"Lucida Console">spark-defaults.conf</font></span></code><=
span lang=3D"EN-US" style=3D"color:rgb(102,102,102);font-family:CalibreWeb-=
Light;font-size:15pt"> file or
on a </span><code><span lang=3D"EN-US" style=3D"font-size:9pt"><font color=
=3D"#444444" face=3D"Lucida Console">SparkConf</font></span></code><span la=
ng=3D"EN-US" style=3D"color:rgb(102,102,102);font-family:CalibreWeb-Light;f=
ont-size:15pt"> object. Similarly, the heap size can be controlled with
the </span><code><span lang=3D"EN-US" style=3D"font-size:9pt"><font color=
=3D"#444444" face=3D"Lucida Console">--executor-memory</font></span></code>=
<span lang=3D"EN-US" style=3D"color:rgb(102,102,102);font-family:CalibreWeb=
-Light;font-size:15pt">=C2=A0flag or the </span><code><span lang=3D"EN-US" =
style=3D"font-size:9pt"><font color=3D"#444444" face=3D"Lucida Console">spa=
rk.executor.memory</font></span></code><span lang=3D"EN-US" style=3D"color:=
rgb(102,102,102);font-family:CalibreWeb-Light;font-size:15pt"> property.=C2=
=A0The </span><code><span lang=3D"EN-US" style=3D"font-size:9pt"><font colo=
r=3D"#444444" face=3D"Lucida Console">cores</font></span></code><span lang=
=3D"EN-US" style=3D"color:rgb(102,102,102);font-family:CalibreWeb-Light;fon=
t-size:15pt">
property controls the number of concurrent tasks an executor can run<b>.=C2=
=A0</b></span><code><b><span lang=3D"EN-US" style=3D"font-size:9pt"><font c=
olor=3D"#444444" face=3D"Lucida Console">--executor-cores
5</font></span></b></code><b><span lang=3D"EN-US" style=3D"color:rgb(102,10=
2,102);font-family:CalibreWeb-Light;font-size:15pt"> means that each execut=
or can run a maximum of five tasks at the
same time.</span></b><span lang=3D"EN-US" style=3D"color:rgb(102,102,102);f=
ont-family:CalibreWeb-Light;font-size:15pt"> The memory property impacts th=
e
amount of data Spark can cache, as well as the maximum sizes of the shuffle
data structures used for grouping, aggregations, and joins.</span></p><font=
 color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p><b><span lang=3D"EN-US" style=3D"color:rgb(102,102,102);font-fami=
ly:CalibreWeb-Light;font-size:15pt">The
</span></b><code><b><span lang=3D"EN-US" style=3D"font-size:9pt"><font colo=
r=3D"#444444" face=3D"Lucida Console">--num-executors</font></span></b></co=
de><b><span lang=3D"EN-US" style=3D"color:rgb(102,102,102);font-family:Cali=
breWeb-Light;font-size:15pt"> command-line flag or </span></b><code><b><spa=
n lang=3D"EN-US" style=3D"font-size:9pt"><font color=3D"#444444" face=3D"Lu=
cida Console">spark.executor.instances</font></span></b></code><b><span lan=
g=3D"EN-US" style=3D"color:rgb(102,102,102);font-family:CalibreWeb-Light;fo=
nt-size:15pt">
configuration property control the number of executors requeste</span></b><=
span lang=3D"EN-US" style=3D"color:rgb(102,102,102);font-family:CalibreWeb-=
Light;font-size:15pt">d. Starting in CDH 5.4/Spark 1.3, you will be able to
avoid setting this property by turning on <a href=3D"https://spark.apache.o=
rg/docs/latest/job-scheduling.html#dynamic-resource-allocation"><font color=
=3D"#0563c1">dynamic
allocation</font></a> with the </span><code><span lang=3D"EN-US" style=3D"f=
ont-size:9pt"><font color=3D"#444444" face=3D"Lucida Console">spark.dynamic=
Allocation.enabled</font></span></code><span lang=3D"EN-US" style=3D"color:=
rgb(102,102,102);font-family:CalibreWeb-Light;font-size:15pt"> property. Dy=
namic allocation enables a Spark application
to request executors when there is a backlog of pending tasks and free up
executors when idle.</span></p><font color=3D"#000000" face=3D"Times New Ro=
man" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quo=
t;,sans-serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000=
000" face=3D"Times New Roman" size=3D"3">


</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><b=
><span lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&=
quot;,sans-serif"><font size=3D"3">Resource
scheduling</font></span></b></p><font color=3D"#000000" face=3D"Times New R=
oman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt;line-heig=
ht:15pt"><span lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quo=
t;Arial&quot;,sans-serif"><font size=3D"3">The standalone cluster mode curr=
ently only supports a simple
First In First Out (FIFO) scheduler across applications. Thus to allow mult=
iple
concurrent users, you can control the maximum number of resources each
application will use. </font></span></p><font color=3D"#000000" face=3D"Tim=
es New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt;line-heig=
ht:15pt"><font size=3D"3"><span lang=3D"EN-US" style=3D"color:rgb(29,31,34)=
;font-family:&quot;Arial&quot;,sans-serif">By default, the default memory u=
sed in 512M. You can increase
this value by setting the following parameter in $SPARK_HOME/conf/</span><s=
pan lang=3D"EN-US" style=3D"font-family:&quot;Arial&quot;,sans-serif"><font=
 color=3D"#000000"> </font><span style=3D"color:rgb(29,31,34)">spark-defaul=
ts.conf</span></span></font></p><font color=3D"#000000" face=3D"Times New R=
oman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt;line-heig=
ht:15pt"><span lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quo=
t;Courier New&quot;"><font size=3D"3">export spark.driver.memory<span>=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>4g</font></span></p><f=
ont color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt;line-heig=
ht:15pt"><span lang=3D"EN-US" style=3D"color:rgb(36,39,41);font-family:&quo=
t;Arial&quot;,sans-serif"><font size=3D"3">or by supplying configuration se=
tting at runtime to spark-shell
or spark-submit</font></span></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt;line-heig=
ht:15pt"><font size=3D"3"><b><span lang=3D"EN-US" style=3D"color:rgb(29,31,=
34);font-family:&quot;Arial&quot;,sans-serif">Note that in
this mode a process will acquire all cores in the cluster, which only makes
sense if you just run one application at a time</span></b><span lang=3D"EN-=
US" style=3D"color:rgb(29,31,34);font-family:&quot;Arial&quot;,sans-serif">=
. </span></font></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:0cm 0cm 0pt;line-heig=
ht:15pt"><font size=3D"3"><span lang=3D"EN-US" style=3D"color:rgb(29,31,34)=
;font-family:&quot;Arial&quot;,sans-serif">You can cap the </span><span lan=
g=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot;,sans-serif">=
number of cores by setting spark.cores.max in
your SparkConf. For example:</span></font></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Helvetica&=
quot;,sans-serif;font-size:10.5pt"><span>=C2=A0 </span></span><span lang=3D=
"EN-US" style=3D"color:black;font-family:&quot;Courier New&quot;;font-size:=
10pt">val conf =3D
new SparkConf().</span></p><font color=3D"#000000" face=3D"Times New Roman"=
 size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;;font-size:10pt"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>setAppName(&quot;MyApplication&quot;).</span></p><font color=3D"#000=
000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;;font-size:10pt"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>setMaster(&quot;local[2]&quo=
t;).</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;;font-size:10pt"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>set(&quot;spark.executor.memory&quot;, &quot;4G&quot;).</span></p><f=
ont color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;;font-size:10pt"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>set(&quot;spark.cores.max&qu=
ot;,
&quot;2&quot;).</span></p><font color=3D"#000000" face=3D"Times New Roman" =
size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;;font-size:10pt"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>set(&quot;spark.driver.allowMultipleContexts&quot;, &quot;true&quot;=
)</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;;font-size:10pt"><span>=C2=A0 </span>val sc =3D new SparkContext(con=
f)</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;;font-size:10pt">=C2=A0</span></p><font color=3D"#000000" face=3D"Ti=
mes New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot=
;,sans-serif"><font size=3D"3">Note that
setMaster(&quot;local[2]&quot;). Specifies that it is run locally with two
threads</font></span></p><font color=3D"#000000" face=3D"Times New Roman" s=
ize=3D"3">

</font><ul style=3D"list-style-type:disc;direction:ltr"><li style=3D"font-s=
tyle:normal;font-weight:normal"><p style=3D"background:rgb(217,217,217);lin=
e-height:15pt;font-style:normal;font-weight:normal;margin-top:0cm;margin-bo=
ttom:0pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial=
&quot;,sans-serif">local uses 1 thread.</span></p></li><li style=3D"color:b=
lack;font-family:&quot;Times New Roman&quot;,serif;font-size:12pt;font-styl=
e:normal;font-weight:normal"><p style=3D"background:rgb(217,217,217);color:=
rgb(0,0,0);line-height:15pt;font-family:&quot;Calibri&quot;,sans-serif;font=
-size:11pt;font-style:normal;font-weight:normal;margin-top:0cm;margin-botto=
m:0pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&qu=
ot;,sans-serif">local[N] uses N threads.</span></p></li><li style=3D"color:=
black;font-family:&quot;Times New Roman&quot;,serif;font-size:12pt;font-sty=
le:normal;font-weight:normal"><p style=3D"background:rgb(217,217,217);color=
:rgb(0,0,0);line-height:15pt;font-family:&quot;Calibri&quot;,sans-serif;fon=
t-size:11pt;font-style:normal;font-weight:normal;margin-top:0cm;margin-bott=
om:0pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&q=
uot;,sans-serif">local[*] uses as many threads as there are
cores.</span></p></li></ul><font color=3D"#000000" face=3D"Times New Roman"=
 size=3D"3">


</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000000" fac=
e=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot;,sans-=
serif"><font size=3D"3">However, since driver-memory setting
encapsulates the JVM, you will need to set the amount of driver memory for =
any
non-default value <b>before starting JVM by
providing the new value:</b></font></span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3">${SPARK_HOME}/bin/spark-shell
--driver-memory 4g</font></span></p><font color=3D"#000000" face=3D"Times N=
ew Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3">Or</font></span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3">${SPARK_HOME}/bin/spark-submit --driver-memory
4g</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;;font-size:10pt">=C2=A0</span></p><font color=3D"#000000" face=3D"Ti=
mes New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot=
;,sans-serif"><font size=3D"3">You can of course have a simple
SparkConf values and set the additional Spark configuration parameters at
submit time</font></span></p><font color=3D"#000000" face=3D"Times New Roma=
n" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#000000" f=
ace=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot=
;,sans-serif"><font size=3D"3">Example</font></span></p><font color=3D"#000=
000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot=
;,sans-serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#0000=
00" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3">val sparkConf =3D new
SparkConf().</font></span></p><font color=3D"#000000" face=3D"Times New Rom=
an" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>setAppName(&quot;CEP_streaming&quot;).</font></span></p><font color=
=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><b><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier=
 New&quot;"><font size=3D"3"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>setMaster(&quot;local[2]&quot;).</=
font></span></b></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>set(&quot;spark.streaming.concurrentJobs&quot;, &quot;2&quot;).</fon=
t></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>set(&quot;spark.driver.allowMultipleContexts&quot;, &quot;true&quot;=
).</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>set(&quot;spark.hadoop.validateOutputSpecs&quot;, &quot;false&quot;)=
</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:blue;font-family:&quot;Helvetica&q=
uot;,sans-serif;font-size:10pt">=C2=A0</span></p><font color=3D"#000000" fa=
ce=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot=
;,sans-serif"><font size=3D"3">And at submit time do</font></span></p><font=
 color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:blue;font-family:&quot;Helvetica&q=
uot;,sans-serif;font-size:10pt">=C2=A0</span></p><font color=3D"#000000" fa=
ce=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3">${SPARK_HOME}/bin/spark-submit
\</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D=
"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--master local[2] \=
</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--driver-memory 4G =
\</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D=
"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--num-executors 1 \=
</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--executor-memory 4=
G \</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;"><font size=3D"3"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>--executor-cores 2 =
\</font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D=
"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Courier Ne=
w&quot;;font-size:10pt"><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>=E2=80=A6..</span></p>=
<font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot=
;,sans-serif"><font size=3D"3">Note that this will
override earlier Spark configuration parameters with sparkConf</font></span=
></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217);margin:1em 0cm;line-height:1=
5pt"><span lang=3D"EN-US" style=3D"color:black;font-family:&quot;Arial&quot=
;,sans-serif"><font size=3D"3">=C2=A0</font></span></p><font color=3D"#0000=
00" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white;margin:0cm 0cm 0pt;line-height:15pt"><s=
pan lang=3D"EN-US" style=3D"color:blue;font-family:&quot;Helvetica&quot;,sa=
ns-serif;font-size:10.5pt">=C2=A0</span></p><font color=3D"#000000" face=3D=
"Times New Roman" size=3D"3">

</font><p style=3D"background:white"><b><span style=3D"color:rgb(84,84,84);=
font-family:&quot;Arial&quot;,sans-serif;font-size:11pt">Resource Monitorin=
g</span></b></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3"=
>

</font><p style=3D"background:white"><span style=3D"color:rgb(84,84,84);fon=
t-family:&quot;Arial&quot;,sans-serif;font-size:11pt">You can see the job p=
rogress in Spark Job GUI that by
default runs on </span><span lang=3D"EN" style=3D"letter-spacing:0.15pt;fon=
t-family:&quot;Arial&quot;,sans-serif;font-size:11pt"><font color=3D"#00000=
0">&lt;HOST&gt;:4040. This GUI has
different tabs for Jobs, Stages, Executors etc. An example is shown below:<=
/font></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3=
">

</font><p style=3D"background:white"><span lang=3D"EN" style=3D"letter-spac=
ing:0.15pt;font-family:&quot;Arial&quot;,sans-serif;font-size:11pt"><font c=
olor=3D"#000000">=C2=A0</font></span></p><font color=3D"#000000" face=3D"Ti=
mes New Roman" size=3D"3">

</font><p style=3D"background:white"><span style=3D"color:rgb(84,84,84);fon=
t-family:&quot;Arial&quot;,sans-serif;font-size:11pt">
=20
</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p align=3D"center" style=3D"background:white;margin:0cm 0cm 0pt;tex=
t-align:center;line-height:15pt"><b><span lang=3D"EN-US" style=3D"color:bla=
ck;font-family:&quot;Arial&quot;,sans-serif"><font size=3D"3">Figure 3: A t=
ypical Spark Job URL</font></span></b></p><font color=3D"#000000" face=3D"T=
imes New Roman" size=3D"3">

</font><p style=3D"background:white"><span style=3D"color:rgb(84,84,84);fon=
t-family:&quot;Arial&quot;,sans-serif;font-size:11pt">=C2=A0</span></p><fon=
t color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white"><span style=3D"color:rgb(84,84,84);fon=
t-family:&quot;Arial&quot;,sans-serif;font-size:11pt">Figure 3 shows the st=
atus of Jobs. This is a simple
job that uses JDBC to access Oracle database and a table called dummy with =
1
billion rows. It then takes that table, caches it by registering it as
temptable, create an ORC table in Hive and populates that table. It was
compiled using Maven and executed through $SPARK_HOME/sbin/spark-submit.sh<=
/span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white"><span style=3D"color:rgb(84,84,84);fon=
t-family:&quot;Arial&quot;,sans-serif;font-size:11pt">=C2=A0</span></p><fon=
t color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white"><span style=3D"color:rgb(84,84,84);fon=
t-family:&quot;Arial&quot;,sans-serif;font-size:11pt">The code is shown bel=
ow: for</span><span style=3D"color:rgb(84,84,84);font-family:&quot;Courier =
New&quot;;font-size:10pt">
ETL_scratchpad_dummy.scala</span></p><font color=3D"#000000" face=3D"Times =
New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">import org.apac=
he.spark.SparkContext</span></p><font color=3D"#000000" face=3D"Times New R=
oman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">import org.apac=
he.spark.SparkConf</span></p><font color=3D"#000000" face=3D"Times New Roma=
n" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">import org.apac=
he.spark.sql.Row</span></p><font color=3D"#000000" face=3D"Times New Roman"=
 size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">import org.apac=
he.spark.sql.hive.HiveContext</span></p><font color=3D"#000000" face=3D"Tim=
es New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">import org.apac=
he.spark.sql.types._</span></p><font color=3D"#000000" face=3D"Times New Ro=
man" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">import org.apac=
he.spark.sql.SQLContext</span></p><font color=3D"#000000" face=3D"Times New=
 Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">import org.apac=
he.spark.sql.functions._</span></p><font color=3D"#000000" face=3D"Times Ne=
w Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">object ETL_scra=
tchpad_dummy {</span></p><font color=3D"#000000" face=3D"Times New Roman" s=
ize=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>def main(args: Array[String]) {</span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>val conf =3D new SparkConf().</span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span><span>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0</span>setAppName(&quot;ETL_scratchpad_dummy&quot;).</span><=
/p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>setMaster(&quot;local[2]&quot;).</span></p><font color=3D"#000000" f=
ace=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>set(&quot;spark.executor.memory&quot;, &quot;4G&quot;).</span></p><f=
ont color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>set(&quot;spark.cores.max&quot;, &quot;2&quot;).</span></p><font col=
or=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
</span>set(&quot;spark.driver.allowMultipleContexts&quot;, &quot;true&quot;=
)</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>val sc =3D new SparkContext(conf)</span></p><font color=3D"#000000" fa=
ce=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>// Create sqlContext based on
HiveContext</span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>val sqlContext =3D new
HiveContext(sc)</span></p><font color=3D"#000000" face=3D"Times New Roman" =
size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>import sqlContext.implicits._</span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>val HiveContext =3D new
org.apache.spark.sql.hive.HiveContext(sc)</span></p><font color=3D"#000000"=
 face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>println (&quot;\nStarted
at&quot;); sqlContext.sql(&quot;SELECT FROM_unixtime(unix_timestamp(),
&#39;dd/MM/yyyy HH:mm:ss.ss&#39;) &quot;).collect.foreach(println)</span></=
p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">=C2=A0</span></=
p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>HiveContext.sql(&quot;use
oraclehadoop&quot;)</span></p><font color=3D"#000000" face=3D"Times New Rom=
an" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">=C2=A0</span></=
p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>var _ORACLEserver : String =3D
&quot;jdbc:oracle:thin:@rhes564:1521:mydb12&quot;</span></p><font color=3D"=
#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>var _username : String =3D
&quot;scratchpad&quot;</span></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>var _password : String =3D
&quot;xxxxxx&quot;</span></p><font color=3D"#000000" face=3D"Times New Roma=
n" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">=C2=A0</span></=
p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">=C2=A0</span></=
p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>// Get data from Oracle table
scratchpad.dummy</span></p><font color=3D"#000000" face=3D"Times New Roman"=
 size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">=C2=A0</span></=
p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>val d =3D
HiveContext.load(&quot;jdbc&quot;,</span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>Map(&quot;url&quot; -&gt;
_ORACLEserver,</span></p><font color=3D"#000000" face=3D"Times New Roman" s=
ize=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>&quot;dbtable&quot; -&gt;
&quot;(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED,
to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS RANDOMISED,
RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)&quot;,</span></p><f=
ont color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>&quot;user&quot; -&gt;
_username,</span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>&quot;password&quot; -&gt;
_password))</span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">=C2=A0</span></=
p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0
</span>d.registerTempTable(&quot;tmp&quot;)</span></p><font color=3D"#00000=
0" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>//</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>// Need to create and populate
target ORC table oraclehadoop.dummy</span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>//</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>HiveContext.sql(&quot;use
oraclehadoop&quot;)</span></p><font color=3D"#000000" face=3D"Times New Rom=
an" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>//</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>// Drop and create table dummy</span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>//</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>HiveContext.sql(&quot;DROP TABLE
IF EXISTS oraclehadoop.dummy&quot;)</span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>var sqltext : String =3D
&quot;&quot;</span></p><font color=3D"#000000" face=3D"Times New Roman" siz=
e=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>sqltext =3D &quot;&quot;&quot;</span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>CREATE TABLE oraclehadoop.dummy
(</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0 </span>ID INT</span></p><font color=3D"#000000" face=3D"Tim=
es New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0 </span>, CLUSTERED INT</span></p><font color=3D"#000000" face=3D"Times =
New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0 </span>, SCATTERED INT</span></p><font color=3D"#000000" face=3D"Times =
New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0 </span>, RANDOMISED INT</span></p><font color=3D"#000000" face=3D"Times=
 New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0 </span>, RANDOM_STRING VARCHAR(50)</span></p><font color=3D"#000000" fa=
ce=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0 </span>, SMALL_VC VARCHAR(10)</span></p><font color=3D"#000000" face=3D=
"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0 </span>, PADDING<span>=C2=A0 </span>VARCHAR(10)</span></p><font color=
=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>)</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3=
">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>CLUSTERED BY (ID) INTO 256
BUCKETS</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>STORED AS ORC</span></p><font color=3D"#000000" face=3D"Times New Roma=
n" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>TBLPROPERTIES (</span></p><font color=3D"#000000" face=3D"Times New Ro=
man" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0
</span>&quot;orc.create.index&quot;=3D&quot;true&quot;,</span></p><font col=
or=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0
</span>&quot;orc.bloom.filter.columns&quot;=3D&quot;ID&quot;,</span></p><fo=
nt color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0
</span>&quot;orc.bloom.filter.fpp&quot;=3D&quot;0.05&quot;,</span></p><font=
 color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0
</span>&quot;orc.compress&quot;=3D&quot;SNAPPY&quot;,</span></p><font color=
=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0
</span>&quot;orc.stripe.size&quot;=3D&quot;16777216&quot;,</span></p><font =
color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0
</span>&quot;orc.row.index.stride&quot;=3D&quot;10000&quot; )</span></p><fo=
nt color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>&quot;&quot;&quot;</span></p><font color=3D"#000000" face=3D"Times New=
 Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0 </span>HiveContext.sql(sqltext)</span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>//</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>// Put data in Hive table. Clean
up is already done</span></p><font color=3D"#000000" face=3D"Times New Roma=
n" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>//</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"=
3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>sqltext =3D &quot;&quot;&quot;</span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>INSERT INTO TABLE
oraclehadoop.dummy</span></p><font color=3D"#000000" face=3D"Times New Roma=
n" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>SELECT</span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>ID</span></p><font col=
or=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, CLUSTERED</span></p><font color=
=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, SCATTERED</span></p><font color=
=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, RANDOMISED</span></p><font color=
=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, RANDOM_STRING</span></p><font co=
lor=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, SMALL_VC</span></p><font color=
=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, PADDING</span></p><font color=3D=
"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>FROM tmp</span></p><font color=3D"#000000" face=3D"Times New Roman" si=
ze=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>&quot;&quot;&quot;</span></p><font color=3D"#000000" face=3D"Times New=
 Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0=C2=
=A0 </span>HiveContext.sql(sqltext)</span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>println (&quot;\nFinished
at&quot;); sqlContext.sql(&quot;SELECT FROM_unixtime(unix_timestamp(),
&#39;dd/MM/yyyy HH:mm:ss.ss&#39;) &quot;).collect.foreach(println)</span></=
p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0 </=
span>sys.exit()</span></p><font color=3D"#000000" face=3D"Times New Roman" =
size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt"><span>=C2=A0</s=
pan>}</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3"=
>

</font><p style=3D"background:rgb(217,217,217)"><span style=3D"color:rgb(84=
,84,84);font-family:&quot;Courier New&quot;;font-size:10pt">}</span></p><fo=
nt color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white"><span style=3D"color:rgb(84,84,84);fon=
t-family:&quot;Arial&quot;,sans-serif;font-size:11pt">=C2=A0</span></p><fon=
t color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white"><span lang=3D"EN-US" style=3D"color:rg=
b(29,31,34);font-family:&quot;Arial&quot;,sans-serif;font-size:11pt">If you=
 look at Figure 3 you will
the status of the job broken into <i>Active
Jobs</i> and <i>Completed Jobs </i>respectively.
The description is pretty smart. It tells you which line of code was execut=
ed. For
example =E2=80=9C</span><span lang=3D"EN-US" style=3D"background:rgb(249,24=
9,249);font-family:&quot;Arial&quot;,sans-serif;font-size:11pt"><font color=
=3D"#000000">collect at ETL_scratchpad_dummy.scala:24=E2=80=9D refers to li=
ne 24
of the code which is below:</font></span></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"font-family:&quot;Courier New&quot;;font-size:10pt"><font color=3D"#000=
000">println
(&quot;\nStarted at&quot;); sqlContext.sql(&quot;SELECT
FROM_unixtime(unix_timestamp(), &#39;dd/MM/yyyy HH:mm:ss.ss&#39;)
&quot;).collect.foreach(println) </font></span></p><font color=3D"#000000" =
face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white"><span lang=3D"EN-US" style=3D"color:rg=
b(29,31,34);font-family:&quot;Helvetica&quot;,sans-serif;font-size:10.5pt">=
=C2=A0</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3=
">

</font><p style=3D"background:white"><span lang=3D"EN-US" style=3D"color:rg=
b(29,31,34);font-family:&quot;Arial&quot;,sans-serif;font-size:11pt">This J=
ob (Job Id 0) is already
completed</span></p><font color=3D"#000000" face=3D"Times New Roman" size=
=3D"3">

</font><p style=3D"background:white"><span lang=3D"EN-US" style=3D"color:rg=
b(29,31,34);font-family:&quot;Helvetica&quot;,sans-serif;font-size:10.5pt">=
On the other hand Active Job Id 1 =E2=80=9C</span><font color=3D"#000000"><=
span lang=3D"EN-US" style=3D"background:rgb(249,249,249);font-family:&quot;=
Helvetica&quot;,sans-serif;font-size:10.5pt">sql at ETL_scratchpad_dummy.sc=
ala:87</span><span lang=3D"EN-US"><font face=3D"Times New Roman" size=3D"3"=
>=E2=80=9D </font></span></font><span lang=3D"EN-US" style=3D"color:rgb(29,=
31,34);font-family:&quot;Helvetica&quot;,sans-serif;font-size:10.5pt">is cu=
rrently running at
line 87 of the code which is</span></p><font color=3D"#000000" face=3D"Time=
s New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0 </span>sqltext =3D
&quot;&quot;&quot;</span></p><font color=3D"#000000" face=3D"Times New Roma=
n" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0 </span>INSERT INTO TABLE
oraclehadoop.dummy</span></p><font color=3D"#000000" face=3D"Times New Roma=
n" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0 </span>SELECT</span></p><font color=3D"#000000" face=3D"Times=
 New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>ID</sp=
an></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, CLUSTERED</span>=
</p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, SCATTERED</span>=
</p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, RANDOMISED</span=
></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, RANDOM_STRING</s=
pan></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, SMALL_VC</span><=
/p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </span>, PADDING</span></=
p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0 </span>FROM tmp</span></p><font color=3D"#000000" face=3D"Tim=
es New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0 </span>&quot;&quot;&quot;</span></p><font color=3D"#000000" f=
ace=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:rgb(217,217,217)"><span lang=3D"EN-US" style=
=3D"color:rgb(29,31,34);font-family:&quot;Courier New&quot;;font-size:10pt"=
><span>=C2=A0=C2=A0
</span>HiveContext.sql(sqltext)</span></p><font color=3D"#000000" face=3D"T=
imes New Roman" size=3D"3">

</font><p style=3D"background:white"><span lang=3D"EN-US" style=3D"color:rg=
b(29,31,34);font-family:&quot;Helvetica&quot;,sans-serif;font-size:10pt">=
=C2=A0</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3=
">

</font><p style=3D"background:white"><span lang=3D"EN-US" style=3D"color:rg=
b(29,31,34);font-family:&quot;Arial&quot;,sans-serif;font-size:11pt">We can=
 look at this job further
by looking at the active job session in GUI though stages</span></p><font c=
olor=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white"><span lang=3D"EN-US" style=3D"color:rg=
b(29,31,34);font-family:&quot;Helvetica&quot;,sans-serif;font-size:10pt">=
=C2=A0</span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3=
">

</font><p style=3D"background:white"><span><font color=3D"#000000"><font fa=
ce=3D"Times New Roman"><font size=3D"3">
=20
</font></font></font></span></p><font color=3D"#000000" face=3D"Times New R=
oman" size=3D"3">

</font><p align=3D"center" style=3D"background:white;margin:0cm 0cm 0pt;tex=
t-align:center;line-height:15pt"><b><span lang=3D"EN-US" style=3D"color:bla=
ck;font-family:&quot;Arial&quot;,sans-serif"><font size=3D"3">Figure 4: Dri=
lling down to execution</font></span></b></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"background:white"><span lang=3D"EN-US" style=3D"color:rg=
b(29,31,34);font-family:&quot;Helvetica&quot;,sans-serif;font-size:10.5pt">=
<br></span></p><p style=3D"background:white"><span lang=3D"EN-US" style=3D"=
color:rgb(29,31,34);font-family:&quot;Helvetica&quot;,sans-serif;font-size:=
10.5pt">=C2=A0...................</span></p><p style=3D"background:white"><=
span lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;Helvetic=
a&quot;,sans-serif;font-size:10.5pt"><br></span></p><p style=3D"background:=
white"><span lang=3D"EN-US" style=3D"color:rgb(29,31,34);font-family:&quot;=
Helvetica&quot;,sans-serif;font-size:10.5pt">HTH</span></p><font color=3D"#=
000000" face=3D"Times New Roman" size=3D"3">

</font></div></div><div class=3D"gmail_extra"><br clear=3D"all"><div><div c=
lass=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr=
"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><font c=
olor=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">Dr Mich Talebzadeh</font></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"font-family:&quot;Ari=
al&quot;,sans-serif"><font color=3D"#000000" size=3D"3">LinkedIn </font></s=
pan><i><span style=3D"font-family:&quot;Arial&quot;,sans-serif;font-size:10=
pt"><font color=3D"#000000">=C2=A0</font><a href=3D"https://www.linkedin.co=
m/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw" target=3D"_bla=
nk"><font color=3D"#0000ff">https://www.linkedin.com/profile/view?id=3DAAEA=
AAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw</font></a></span></i></p><font color=3D=
"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt;text-align:justify"><span style=3D"fo=
nt-family:&quot;Arial&quot;,sans-serif;font-size:10pt"><a href=3D"http://ta=
lebzadehmich.wordpress.com" target=3D"_blank"><font color=3D"#0000ff">http:=
//talebzadehmich.wordpress.com</font></a></span></p><p style=3D"margin:0cm =
0cm 0pt;text-align:justify"><span style=3D"font-family:&quot;Arial&quot;,sa=
ns-serif;font-size:10pt"><br></span></p><span style=3D"font-family:&quot;Ar=
ial&quot;,sans-serif;font-size:10pt"><p style=3D"margin:0cm 0cm 0pt;text-al=
ign:justify"><font color=3D"#000000"><b><span style=3D"font-family:&quot;Ti=
mes New Roman&quot;,serif;font-size:9pt">Disclaimer:</span></b><span style=
=3D"font-family:&quot;Times New Roman&quot;,serif;font-size:9pt">=C2=A0Use =
it=C2=A0at your own risk.<font size=3D"3"> </font>Any and all responsibilit=
y for any loss, damage or destruction
of data or any other property which may arise from relying on this email=
9;s=C2=A0technical=C2=A0content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from =
such
loss, damage or destruction. </span></font></p></span><p style=3D"margin:0c=
m 0cm 0pt;text-align:justify"><span style=3D"font-family:&quot;Arial&quot;,=
sans-serif;font-size:9pt"><font color=3D"#000000">=C2=A0</font></span></p><=
font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font></div></div></div></div></div></div></div></div></div>
<br><div class=3D"gmail_quote">On 21 July 2016 at 12:27, Joaquin Alzola <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:Joaquin.Alzola@lebara.com" target=3D"_=
blank">Joaquin.Alzola@lebara.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


<div lang=3D"EN-GB" link=3D"blue" vlink=3D"purple">
<div>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">You have the same as link 1 but =
in English?<u></u><u></u></span></p>
<ul type=3D"disc">
<li class=3D"MsoNormal">
<span style=3D"font-family:&quot;Georgia&quot;,serif"><a href=3D"http://lit=
aotao.github.io/spark-questions-concepts?s=3Dgmail" target=3D"_blank">spark=
-questions-concepts</a></span><u></u><u></u></li><li class=3D"MsoNormal">
<a href=3D"http://litaotao.github.io/deep-into-spark-exection-model?s=3Dgma=
il" target=3D"_blank">deep-into-spark-exection-model=C2=A0</a><u></u><u></u=
></li></ul>
<p class=3D"MsoNormal">Seems really interesting post but in Chinese. I supp=
ose google translate suck on the translation.<u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><b><span lang=3D"EN-US" style=3D"font-family:&quot;C=
alibri&quot;,sans-serif;font-size:11pt">From:</span></b><span lang=3D"EN-US=
" style=3D"font-family:&quot;Calibri&quot;,sans-serif;font-size:11pt"> Taot=
ao.Li [mailto:<a href=3D"mailto:charles.upboy@gmail.com" target=3D"_blank">=
charles.upboy@gmail.com</a>]
<br>
<b>Sent:</b> 21 July 2016 04:04<br>
<b>To:</b> Jean Georges Perrin &lt;<a href=3D"mailto:jgp@jgp.net" target=3D=
"_blank">jgp@jgp.net</a>&gt;<br>
<b>Cc:</b> Sachin Mittal &lt;<a href=3D"mailto:sjmittal@gmail.com" target=
=3D"_blank">sjmittal@gmail.com</a>&gt;; user &lt;<a href=3D"mailto:user@spa=
rk.apache.org" target=3D"_blank">user@spark.apache.org</a>&gt;<br>
<b>Subject:</b> Re: Understanding spark concepts cluster, master, slave, jo=
b, stage, worker, executor, task<u></u><u></u></span></p><div><div class=3D=
"h5">
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-family:&quot;Georgia&quot;,serif=
">Hi, Sachin, =C2=A0here are two posts about the basic concepts about spark=
:<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-family:&quot;Georgia&quot;,serif=
"><u></u>=C2=A0<u></u></span></p>
</div>
<div>
<ul type=3D"disc">
<li class=3D"MsoNormal">
<span style=3D"font-family:&quot;Georgia&quot;,serif"><a href=3D"http://lit=
aotao.github.io/spark-questions-concepts?s=3Dgmail" target=3D"_blank">spark=
-questions-concepts</a></span><u></u><u></u></li><li class=3D"MsoNormal">
<a href=3D"http://litaotao.github.io/deep-into-spark-exection-model?s=3Dgma=
il" target=3D"_blank">deep-into-spark-exection-model=C2=A0</a><u></u><u></u=
></li></ul>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">And, I fully recommend databrick&#39;s post:=C2=A0<a=
 href=3D"https://databricks.com/blog/2016/06/22/apache-spark-key-terms-expl=
ained.html" target=3D"_blank">https://databricks.com/blog/2016/06/22/apache=
-spark-key-terms-explained.html</a><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">On Thu, Jul 21, 2016 at 1:36 AM, Jean Georges Perrin=
 &lt;<a href=3D"mailto:jgp@jgp.net" target=3D"_blank">jgp@jgp.net</a>&gt; w=
rote:<u></u><u></u></p>
<blockquote style=3D"border-width:medium medium medium 1pt;border-style:non=
e none none solid;border-color:currentColor currentColor currentColor rgb(2=
04,204,204);padding:0cm 0cm 0cm 6pt;margin-right:0cm;margin-left:4.8pt">
<div>
<p class=3D"MsoNormal">Hey,<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I love when questions are numbered, it&#39;s easier =
:)<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">1) Yes (but I am not an expert)<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">2) You don&#39;t control... One of my process is goi=
ng to 8k tasks, so...<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">3) Yes, if you have HT, it double. My servers have 1=
2 cores, but HT, so it makes 24.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">4) From my understanding: Slave is the logical compu=
tational unit and Worker is really the one doing the job.=C2=A0<u></u><u></=
u></p>
</div>
<div>
<p class=3D"MsoNormal">5) Dunnoh<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">6) Dunnoh<u></u><u></u></p>
</div>
<div>
<div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<blockquote style=3D"margin-top:5pt;margin-bottom:5pt">
<div>
<p class=3D"MsoNormal">On Jul 20, 2016, at 1:30 PM, Sachin Mittal &lt;<a hr=
ef=3D"mailto:sjmittal@gmail.com" target=3D"_blank">sjmittal@gmail.com</a>&g=
t; wrote:<u></u><u></u></p>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<p class=3D"MsoNormal">Hi,<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12pt">I was able to build and=
 run my spark application via spark submit.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12pt">I have understood some =
of the concepts by going through the resources at
<a href=3D"https://spark.apache.org/" target=3D"_blank">https://spark.apach=
e.org</a> but few doubts still remain. I have few specific questions and wo=
uld be glad if someone could share some light on it.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">So I submitted the application using spark.master=C2=
=A0=C2=A0=C2=A0 local[*] and I have a 8 core PC.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal"><br>
- What I understand is that application is called as job. Since mine had tw=
o stages it gets divided into 2 stages and each stage had number of tasks w=
hich ran in parallel.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Is this understanding correct.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12pt">- What I notice is that=
 each stage is further divided into 262 tasks From where did this number 26=
2 came from. Is this configurable. Would increasing this number improve per=
formance.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12pt">- Also I see that the t=
asks are run in parallel in set of 8. Is this because I have a 8 core PC.<u=
></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12pt">- What is the differenc=
e or relation between slave and worker. When I did spark-submit did it star=
t 8 slaves or worker threads?<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12pt">- I see all worker thre=
ads running in one single JVM. Is this because I did not start=C2=A0 slaves=
 separately and connect it to a single master cluster manager. If I had don=
e that then each worker would have run
 in its own JVM.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12pt">- What is the relations=
hip between worker and executor. Can a worker have more than one executors?=
 If yes then how do we configure that. Does all executor run in the worker =
JVM and are independent threads.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12pt">I suppose that is all f=
or now. Would appreciate any response.Will add followup questions if any.<u=
></u><u></u></p>
</div>
<p class=3D"MsoNormal">Thanks<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12pt">Sachin<u></u><u></u></p=
>
<div>
<div>
<div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><br>
<br clear=3D"all">
<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<p class=3D"MsoNormal">-- <u></u><u></u></p>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<p class=3D"MsoNormal"><b><span style=3D"color:rgb(111,168,220);font-size:1=
3.5pt">___________________</span></b><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:rgb(19,79,92);font-family:&quot=
;Georgia&quot;,serif;font-size:9.5pt">Quant | Engineer | Boy</span><u></u><=
u></u></p>
</div>
<div>
<div>
<p class=3D"MsoNormal"><b><span style=3D"color:rgb(111,168,220);font-size:1=
3.5pt">___________________</span></b><span style=3D"font-size:9.5pt"><u></u=
><u></u></span></p>
</div>
</div>
<div>
<p class=3D"MsoNormal"><b><i><span style=3D"color:black;font-size:10pt">blo=
g</span></i></b><span style=3D"color:black;font-size:10pt">: =C2=A0 =C2=A0<=
a href=3D"http://litaotao.github.io?utm_source=3Dspark_mail" target=3D"_bla=
nk">http://litaotao.github.io</a></span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><b><i><span style=3D"color:black;font-size:10pt">git=
hub</span></i></b><span style=3D"color:black;font-size:10pt">:
<a href=3D"http://www.github.com/litaotao" target=3D"_blank">www.github.com=
/litaotao</a></span><u></u><u></u></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div></div></div>
This email is confidential and may be subject to privilege. If you are not =
the intended recipient, please do not copy or disclose its content but cont=
act the sender immediately upon receipt.
</div>

</blockquote></div><br></div>

--001a1137b98ecd28ab05382400bd--