crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nithin Asokan <>
Subject Spark Scheduler
Date Wed, 30 Sep 2015 17:41:52 GMT
I was reading about Spark scheduler[1], and this line caught my attention

*Inside a given Spark application (SparkContext instance), multiple
parallel jobs can run simultaneously if they were submitted from separate
threads. By “job”, in this section, we mean a Spark action
(e.g. save, collect) and any tasks that need to run to evaluate that
action. Spark’s scheduler is fully thread-safe and supports this use case
to enable applications that serve multiple requests (e.g. queries for
multiple users).*

If I understood the above statement, I think it is possible to have
multiple jobs running parallel on a Spark application, as long as the *actions
*are triggered by separate thread.

I was trying to test this out on my Crunch Spark application(yarn-client)
which reads two independent HDFS sources and perform *PCollection#getLenght()
*on each source*. *The Spark WebUI starts with Job1 as submitted; after
Job1 is completed Job2 is submitted and finished. I would like to get some
thoughts on whether it is possible in Crunch to identify independent
source/targets and possibly create separate threads that can interact with
Spark scheduler? This way I think we can have some independent jobs running
in parallel.

Here is the example that I used


View raw message