ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From roshan joe <impdocs2...@gmail.com>
Subject spark lookup against shared RDD
Date Tue, 21 Nov 2017 12:45:17 GMT
Below is the use-case I am trying to solve using Spark shared Rdd:

I have a json dataset which is periodically being saved to a shared RDD in
Streaming App-1 using "saveValues" as below.

val sharedRdd: IgniteRDD[String, String] = igniteContext.fromCache[String,
String](cachecfg)
sharedRdd.saveValues(jsonRdd.values)


This SharedRdd is used to lookup values against the incremental dataset in
Streaming App-2. Lookup is performed based on two fields (key and date) in
the json and both the incremental dataset and the sharedRdd contain these
fields. Below is how the values are retrieved from sharedRDD and the lookup
is performed currently.

    //Retrieve values from sharedRdd
    val sharedRdd: IgniteRDD[String, String] =
igniteContext.fromCache("sharedRdd")
    val sharedRddJson = sharedRdd.values

    //convert sharedRdd to DF
    val sharedJsonDF = spark.read.json(sharedRddJson)
    sharedJsonDF.createOrReplaceTempView("sharedJsonDF")

     //convert incremental dataset to DF
    incrementalDF.createOrReplaceTempView("incrementalDF")

    //perform the lookup using Join query
    val sqlQuery = "SELECT * " +
"FROM incrementalDF a " +
"INNER JOIN sharedJsonDF b " +
"ON a.key = b.key " +
"AND a.date <= b.date "

    val sqlDF = spark.sql(sqlQuery)


Below are the questions I have:

   - Would adding the Index on the Join fields on the sharedRdd help to
   improve the performance? If so, what is the best way to add index on the
   json data? I see some index being added in config file and some in code but
   couldn't find a working example.


   - Can the sharedRdd be directly joined against the incrementalDF using
   the "sql"? I couldn't get the "sql" working directly on the sharedRdd using
   "_key", "_val" fields. Do I need to add fields info in the config file for
   the sql to get working?


   - I believe the above join query currently causes a "Shuffle" when the 2
   DFs are joined. Is there a way to colocate sharedRdd and IncrementalRdd on
   the given 2 keys, so that the data movement can be avoided and performance
   improved?


   - Is there any significance of the "number of partitions" in the current
   case?


   - Is there anything else that can be done to make the above join /
   lookup faster?

 Thanks in advance for the time!

Mime
View raw message