spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Deepansh (JIRA)" <>
Subject [jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)
Date Tue, 13 Mar 2018 11:15:00 GMT


Deepansh commented on SPARK-23650:

I tried on local as well as yarn cluster, the result is more or less the same.

Due to this, I went through spark code and as my understanding goes every time a new Kafka stream
comes, spark creates a new RRunner class object and broadcast variables are shipped off to
it. But it should happen only once and not every time stream comes?

> Slow SparkR udf (dapply)
> ------------------------
>                 Key: SPARK-23650
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Shell, SparkR, Structured Streaming
>    Affects Versions: 2.2.0
>            Reporter: Deepansh
>            Priority: Major
> For eg, I am getting streams from Kafka and I want to implement a model made in R for
those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <-"kafka",subscribe = "source", kafka.bootstrap.servers = "localhost:9092",
topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) y<-toJSON(y)
x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new runner thread
and ships the variables again, which causes a huge lag(~2s for shipping model) every time.
I even tried without broadcast variables but it takes same time to ship variables. Can some
other techniques be applied to improve its performance?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message