spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Austin Nothaft <fnoth...@berkeley.edu>
Subject Re: Worker and Nodes
Date Sat, 21 Feb 2015 15:21:55 GMT
There could be many different things causing this. For example, if you only have a single partition
of data, increasing the number of tasks will only increase execution time due to higher scheduling
overhead. Additionally, how large is a single partition in your application relative to the
amount of memory on the machine? If you are running on a machine with a small amount of memory,
increasing the number of executors per machine may increase GC/memory pressure. On a single
node, since your executors share a memory and I/O system, you could just thrash everything.

In any case, you can’t normally generalize between increased parallelism on a single node
and increased parallelism across a cluster. If you are purely limited by CPU, then yes, you
can normally make that generalization. However, when you increase the number of workers in
a cluster, you are providing your app with more resources (memory capacity and bandwidth,
and disk bandwidth). When you increase the number of tasks executing on a single node, you
do not increase the pool of available resources.

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

On Feb 21, 2015, at 4:11 PM, Deep Pradhan <pradhandeep1991@gmail.com> wrote:

> No, I just have a single node standalone cluster.
> 
> I am not tweaking around with the code to increase parallelism. I am just running SparkKMeans
that is there in Spark-1.0.0
> I just wanted to know, if this behavior is natural. And if so, what causes this?
> 
> Thank you
> 
> On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen <sowen@cloudera.com> wrote:
> What's your storage like? are you adding worker machines that are
> remote from where the data lives? I wonder if it just means you are
> spending more and more time sending the data over the network as you
> try to ship more of it to more remote workers.
> 
> To answer your question, no in general more workers means more
> parallelism and therefore faster execution. But that depends on a lot
> of things. For example, if your process isn't parallelize to use all
> available execution slots, adding more slots doesn't do anything.
> 
> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan <pradhandeep1991@gmail.com> wrote:
> > Yes, I am talking about standalone single node cluster.
> >
> > No, I am not increasing parallelism. I just wanted to know if it is natural.
> > Does message passing across the workers account for the happenning?
> >
> > I am running SparkKMeans, just to validate one prediction model. I am using
> > several data sets. I have a standalone mode. I am varying the workers from 1
> > to 16
> >
> > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen <sowen@cloudera.com> wrote:
> >>
> >> I can imagine a few reasons. Adding workers might cause fewer tasks to
> >> execute locally (?) So you may be execute more remotely.
> >>
> >> Are you increasing parallelism? for trivial jobs, chopping them up
> >> further may cause you to pay more overhead of managing so many small
> >> tasks, for no speed up in execution time.
> >>
> >> Can you provide any more specifics though? you haven't said what
> >> you're running, what mode, how many workers, how long it takes, etc.
> >>
> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan <pradhandeep1991@gmail.com>
> >> wrote:
> >> > Hi,
> >> > I have been running some jobs in my local single node stand alone
> >> > cluster. I
> >> > am varying the worker instances for the same job, and the time taken for
> >> > the
> >> > job to complete increases with increase in the number of workers. I
> >> > repeated
> >> > some experiments varying the number of nodes in a cluster too and the
> >> > same
> >> > behavior is seen.
> >> > Can the idea of worker instances be extrapolated to the nodes in a
> >> > cluster?
> >> >
> >> > Thank You
> >
> >
> 


Mime
View raw message