spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Spark performance on 32 Cpus Server Cluster
Date Fri, 20 Feb 2015 12:31:26 GMT
Yes that makes sense, but it doesn't make the jobs CPU-bound. What is
the bottleneck? the model building or other stages? I would think you
can get the model building to be CPU bound, unless you have chopped it
up into really small partitions. I think it's best to look further
into what stages are slow, and what it seems to be spending time on --
GC? I/O?

On Fri, Feb 20, 2015 at 12:18 PM, Dirceu Semighini Filho
<dirceu.semighini@gmail.com> wrote:
> Hi Sean,
> I'm trying to increase the cpu usage by running logistic regression in
> different datasets in parallel. They shouldn't depend on each other.
> I train several  logistic regression models from different column
> combinations of a main dataset. I processed the combinations in a ParArray
> in an attempt to increase cpu usage but id did not help.
>
>
>
> 2015-02-20 8:17 GMT-02:00 Sean Owen <sowen@cloudera.com>:
>
>> It sounds like your computation just isn't CPU bound, right? or maybe
>> that only some stages are. It's not clear what work you are doing
>> beyond the core LR.
>>
>> Stages don't wait on each other unless one depends on the other. You'd
>> have to clarify what you mean by running stages in parallel, like what
>> are the interdependencies.
>>
>> On Fri, Feb 20, 2015 at 10:01 AM, Dirceu Semighini Filho
>> <dirceu.semighini@gmail.com> wrote:
>> > Hi all,
>> > I'm running Spark 1.2.0, in Stand alone mode, on different cluster and
>> > server sizes. All of my data is cached in memory.
>> > Basically I have a mass of data, about 8gb, with about 37k of columns,
>> > and
>> > I'm running different configs of an BinaryLogisticRegressionBFGS.
>> > When I put spark to run on 9 servers (1 master and 8 slaves), with 32
>> > cores
>> > each. I noticed that the cpu usage was varying from 20% to 50% (counting
>> > the cpu usage of 9 servers in the cluster).
>> > First I tried to repartition the Rdds to the same number of total client
>> > cores (256), but that didn't help. After I've tried to change the
>> > property *spark.default.parallelism
>> > * to the same number (256) but that didn't helped to increase the cpu
>> > usage.
>> > Looking at the spark monitoring tool, I saw that some stages  took 52s
>> > to
>> > be completed.
>> > My last shot was trying to run some tasks in parallel, but when I start
>> > running tasks in parallel (4 tasks) the total cpu time spent to complete
>> > this has increased in about 10%, task parallelism didn't helped.
>> > Looking at the monitoring tool I've noticed that when running tasks in
>> > parallel, the stages complete together, if I have 4 stages running in
>> > parallel (A,B,C and D), if A, B and C finishes, they will wait for D to
>> > mark all this 4 stages as completed, is that right?
>> > Is there any way to improve the cpu usage when running on large servers?
>> > Spending more time when running tasks is an expected behaviour?
>> >
>> > Kind Regards,
>> > Dirceu
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message