crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Whitacre <mkwhita...@gmail.com>
Subject Re: Spark Scheduler
Date Wed, 30 Sep 2015 17:54:15 GMT
Try switching your test around a bit because I believe there are instances
even with MRPipeline where Crunch will kick off multiple jobs in parallel.

Something like the following:

Read Input1 -> Filter -> Write Output1
Read Input2 -> Filter -> Write Output2
pipeline.done();

Try with the MRPipeline and then with Spark to see what is in parallel vs
what is serial.

The other option is that is less ideal is that you could change your code
to be:

Read Input1 -> Filter -> Write Output1
pipeline.runAsync()
Read Input2 -> Filter -> Write Output2
pipeline.runAsync()
pipeline.done();

This should kick them each off independently and give you the parallelism.
It would be nice however if you didn't have to do this splitting but was
done for you.


On Wed, Sep 30, 2015 at 12:41 PM, Nithin Asokan <anithin19@gmail.com> wrote:

> I was reading about Spark scheduler[1], and this line caught my attention
>
> *Inside a given Spark application (SparkContext instance), multiple
> parallel jobs can run simultaneously if they were submitted from separate
> threads. By “job”, in this section, we mean a Spark action
> (e.g. save, collect) and any tasks that need to run to evaluate that
> action. Spark’s scheduler is fully thread-safe and supports this use case
> to enable applications that serve multiple requests (e.g. queries for
> multiple users).*
>
> If I understood the above statement, I think it is possible to have
> multiple jobs running parallel on a Spark application, as long as the *actions
> *are triggered by separate thread.
>
> I was trying to test this out on my Crunch Spark application(yarn-client)
> which reads two independent HDFS sources and perform *PCollection#getLenght()
> *on each source*. *The Spark WebUI starts with Job1 as submitted; after
> Job1 is completed Job2 is submitted and finished. I would like to get some
> thoughts on whether it is possible in Crunch to identify independent
> source/targets and possibly create separate threads that can interact with
> Spark scheduler? This way I think we can have some independent jobs running
> in parallel.
>
> Here is the example that I used
> https://gist.github.com/nasokan/7a0820411656f618f182
>
> [1]
> https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
>
>

Mime
View raw message