spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "vijay.bvp" <bvpsa...@gmail.com>
Subject Re: Can spark handle this scenario?
Date Tue, 20 Feb 2018 08:47:07 GMT
I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API
with some token passed, in the code provided so far if you have 2000
symbols, it will make 2000 new connections!! and 2000 API calls
connection objects can't/shouldn't be serialized and send to executors, they
should rather be created at executors. 

the philosophy given below is nicely documented on Spark Streaming, look at
Design Patterns for using foreachRDD
https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams


case class Symbol(symbol: String, sector: String)
case class Tick(symbol: String, sector: String, open: Double, close: Double)
//assume symbolDs is rdd of symbol and tick dataset/dataframe can be
converted to RDD
symbolRdd.foreachPartition(partition => {
       //this code runs at executor    
      //open a connection here - 
      val connectionToYahoo = new HTTPConnection()

      partition.foreach(k => {
          pullSymbolFromYahoo(k.symbol, k.sector,connectionToYahoo)
      }
}

with the above code if the dataset has 10 partitions (2000 symbols), only 10
connections will be opened though it makes 2000 API calls. 
you should also be looking at sending and receiving results for large number
of symbols, because of the amount of parallelism that spark provides you
might run into rate limit on the APIs. if you are bulk sending symbols above
pattern also very much useful

thanks
Vijay







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message