spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hollin Wilkins <hol...@combust.ml>
Subject Re: Can't load a RandomForestClassificationModel in Spark job
Date Wed, 15 Feb 2017 07:47:56 GMT
Hey there,

Creating a new SparkContext on workers will not work, only the driver is
allowed to own a SparkContext. Are you trying to distribute your model to
workers so you can create a distributed scoring service? If so, it may be
worth looking into taking your models outside of a SparkContext and serving
them separately.

If this is your use case, take a look at MLeap. We use it in production to
serve high-volume realtime requests from Spark-trained models:
https://github.com/combust/mleap

Cheers,
Hollin

On Tue, Feb 14, 2017 at 4:46 PM, Jianhong Xia <jxia@infoblox.com> wrote:

> Is there any update on this problem?
>
>
>
> I encountered the same issue that was mentioned here.
>
>
>
> I have CrossValidatorModel.transform(df) running on workers, which
> requires DataFrame as an input. However, we only have Arrays on workers.
> When we deploy our model into cluster mode, we could not create
> createDataFrame on workers. It will give me error:
>
>
>
>
>
> 17/02/13 20:21:27 ERROR Detector$: Error while detecting threats
>
> java.lang.NullPointerException
>
>      at org.apache.spark.sql.SparkSession.sessionState$
> lzycompute(SparkSession.scala:111)
>
>      at org.apache.spark.sql.SparkSession.sessionState(
> SparkSession.scala:109)
>
>      at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>
>      at org.apache.spark.sql.SparkSession.createDataFrame(
> SparkSession.scala:270)
>
>      at com.mycompany.analytics.models.app.serializable.
> AppModeler.detection(modeler.scala:370)
>
>
>
>
>
>
>
> On the other hand, if we run in the local, everything works fine.
>
>
>
> Just want to know, if there is any successful case that run machine
> learning models on the workers.
>
>
>
>
>
> Thanks,
>
> Jianhong
>
>
>
>
>
> *From:* Sumona Routh [mailto:sumosha2@gmail.com]
> *Sent:* Thursday, January 12, 2017 6:20 PM
> *To:* ayan guha <guha.ayan@gmail.com>; user@spark.apache.org
> *Subject:* Re: Can't load a RandomForestClassificationModel in Spark job
>
>
>
> Yes, I save it to S3 in a different process. It is actually the
> RandomForestClassificationModel.load method (passed an s3 path) where I
> run into problems.
> When you say you load it during map stages, do you mean that you are able
> to directly load a model from inside of a transformation? When I try this,
> it passes the function to a worker, and the load method itself appears to
> attempt to create a new SparkContext, which causes an NPE downstream
> (because creating a SparkContext on the worker is not an appropriate thing
> to do, according to various threads I've found).
>
> Maybe there is a different load function I should be using?
>
> Thanks!
>
> Sumona
>
>
>
> On Thu, Jan 12, 2017 at 6:26 PM ayan guha <guha.ayan@gmail.com> wrote:
>
> Hi
>
>
>
> Given training and predictions are two different applications, I typically
> save model objects to hdfs and load it back during prediction map stages.
>
>
>
> Best
>
> Ayan
>
>
>
> On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh <sumosha2@gmail.com> wrote:
>
> Hi all,
>
> I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
>
> I encountered two frustrating issues and would really appreciate some
> advice:
>
> 1)  RandomForestClassificationModel is effectively not serializable (I
> assume it's referencing something that can't be serialized, since it itself
> extends serializable), so I ended up with the well-known exception:
> org.apache.spark.SparkException: Task not serializable.
> Basically, my original intention was to pass the model as a parameter
>
> because which model we use is dynamic based on what record we are
>
> predicting on.
>
> Has anyone else encountered this? Is this currently being addressed? I
> would expect objects from Spark's own libraries be able to be used
> seamlessly in their applications without these types of exceptions.
>
> 2) The RandomForestClassificationModel.load method appears to hang
> indefinitely when executed from inside a map function (which I assume is
> passed to the executor). So, I basically cannot load a model from a worker.
> We have multiple "profiles" that use differently trained models, which are
> accessed from within a map function to run predictions on different sets of
> data.
>
> The thread that is hanging has this as the latest (most pertinent) code:
> org.apache.spark.ml.util.DefaultParamsReader$.
> loadMetadata(ReadWrite.scala:391)
>
> Looking at the code in github, it appears that it is calling sc.textFile.
> I could not find anything stating that this particular function would not
> work from within a map function.
>
> Are there any suggestions as to how I can get this model to work on a real
> production job (either by allowing it to be serializable and passed around
> or loaded from a worker)?
>
> I've extenisvely POCed this model (saving, loading, transforming,
> training, etc.), however this is the first time I'm attempting to use it
> from within a real application.
>
> Sumona
>
>

Mime
View raw message