spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang Cao <>
Subject increase efficiency of working with mongo and mysql database
Date Mon, 26 Sep 2016 15:22:31 GMT
Dear all,

I am currently working with spark 1.6.2, mongodb and mysql. I am stuck with the performance
problem. The working scenario is that reading data from mongo to spark and then do some counting
work get results(several rows), write to mysql database. With pseudocode:

val offset = …
val mongoDF = getMongoDF by strait package(0.11.0).filter(based on offset)

val resDF = doing counting job based on mongoDF

resDF.write().jdbc(info of connection)

Logic is quite simple. But after several test, I found the efficiency of loading from mongo
and saving to mysql become bottleneck of my application.

For the job of reading data from mongo, I find it always split into 2 tasks. The first is
one is flatMap at MongodbSchema.scala:41 and the second one is aggregate at MongodbSchema.scala:47.
In my situation, it looks like this:

It shows that in first step, it only get one task and one executor, which will be extremely
slow in working with collection in billions rows. Sometime, it will take 1hr in first step
but only several seconds in second.

While in jdbc side, it is similar, saving process also in two steps, one with one task and
other with 200, which in DataFrameWriter.scala:311 . 

So my application always get stuck in the stage with only one task. My cluster has free resource
and my mongo server also get idle resources. Can someone explain that why these stages only
get one executor? Is there any suggestion to speed up the stages? 

I have set the configuration, spark.default.parallelism 400. It looks not help.

Need suggestion. THX.


Matthew Cao
View raw message