predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcin Ziemiński <ziem...@gmail.com>
Subject Re: How to access Spark Context in predict?
Date Tue, 27 Sep 2016 15:30:11 GMT
Hi Hasan,

In case of the third point injecting dummy RDD may cause the original
errors when serialized, as you described at the beginning. If you use
PAlgorithm, take a look at PersistenModelLoader. Implementing
PersistentModel together with PersistentModelLoader could help you with an
access to SparkContext. As you can see PersistentModelLoader is provided
with Option[SparkContext] in its apply method, which in this case should be
Some(sc).
Your model can be something encapsulating both RandomForestModel and
SparkContext. With this you would save only RandomForestModel, but during
model loading you should be able to put given SparkContext inside your
model, which you would later get in predict method.

class MyModel(val rfm: RandomForestModel, sc: SparkContext) extends
PersistentModel[MyParams] {
      def save(id: String, params: MyParams, sc: SparkContext): Boolean = {
            // Here you save rfm
      }
}

object MyModel extends PersistentModelLoader[MyParams, MyModel] {
     def apply(id: String, params: MyParams, sc: Option[SparkContext]):
MyModel = {
         // use provided sc to recreate you model and use later in predict
method
     }
}

I hope this helps.

Regards,
Marcin


wt., 27.09.2016 o 13:04 użytkownik Hasan Can Saral <hasancansaral@gmail.com>
napisał:

> Hi Kenneth & Donald,
>
> That was really clarifying, thank you. I really appreciate it. So now I
> know that;
>
> 1- I should use LEventStore and query HBase without sc and with the
> smallest processing as possible in predict,
> 2- In this case I don't have to extend PersistentModel, since I will not
> need sc in predict.
> 3- If I need sc and batch processing in predict, I can save RandomForest
> trees to a file, then I can load it from there. As far as I can see, mt
> only option to access sc for PEventStore is to add a dummy RDD to the
> model, and use dummyRDD.context.
>
> Am I correct, especially in the 3rd point?
>
> Thank you again,
> Hasan
>
> On Tue, Sep 27, 2016 at 9:00 AM, Kenneth Chan <kenneth@apache.org> wrote:
>
>> Hasan,
>>
>> Spark randomforest algo doesn't need RDD. much simpler to simply
>> serialize it and use in local memory in predict().
>> see example here.
>>
>> https://github.com/PredictionIO/template-scala-parallel-leadscoring/blob/develop/src/main/scala/RFAlgorithm.scala
>>
>> For accessing evernt store in predict(), you should use LEventStore API
>> (not PEventStore API) to have fast query for specific events.
>>
>> (use PEventStore API if you really want to do batch processing again in
>> predict() and need RDD for it)
>>
>>
>> Kenneth
>>
>>
>> On Mon, Sep 26, 2016 at 9:19 PM, Donald Szeto <donald@apache.org> wrote:
>>
>>> Hi Hasan,
>>>
>>> Does your randomForestModel contain any RDD?
>>>
>>> If so, implement your algorithm by extending PAlgorithm, have your model
>>> extend PersistentModel, and implement PersistentModelLoader to save and
>>> load your model. You will be able to perform RDD operations within
>>> predict() by using the model's RDD.
>>>
>>> If not, implement your algorithm by extending P2LAlgorithm, and see if
>>> PredictionIO can automatically persist the model for you. The convention
>>> assumes that a non-RDD model does not require Spark to perform any RDD
>>> operations, so there will be no SparkContext access.
>>>
>>> Are these conventions not fitting your use case? Feedbacks are always
>>> welcome for improving PredictionIO.
>>>
>>> Regards,
>>> Donald
>>>
>>>
>>> On Mon, Sep 26, 2016 at 9:05 AM, Hasan Can Saral <
>>> hasancansaral@gmail.com> wrote:
>>>
>>>> Hi Marcin,
>>>>
>>>> I did look at the definition of PersistentModel, and indeed replaced
>>>> LocalFileSystemPersistentModel with PersistenModel. Thank you for this, I
>>>> really appreciate your help.
>>>>
>>>> However, I am having quite hard time understanding how I can access sc
>>>> object that is provided by PredictionIO to save and apply methods within
>>>> predict method.
>>>>
>>>> class SomeModel(randomForestModel: RandomForestModel,dummyRDD: RDD) extends
PersistentModel[SomeAlgorithmParams] {
>>>>
>>>>   override def save(id: String, params: SomeAlgorithmParams, sc: SparkContext):
Boolean = {
>>>>
>>>> // Here I should save randomForestModel to a file, but how to?// Tried saveAsObjectFile
but no luck.
>>>>
>>>> true
>>>>
>>>>   }
>>>> }
>>>>
>>>> object SomeModel extends PersistentModelLoader[SomeAlgorithmParams, FraudModel]
{
>>>>   override def apply(id: String, params: SomeAlgorithmParams, sc: Option[SparkContext]):
SomeModel = {
>>>>
>>>> // // Here should I load randomForestModel from file? How?
>>>>     new SomeModel(randomForestModel)
>>>>
>>>>   }
>>>> }
>>>>
>>>> So, my questions have become:
>>>> 1- Can I save randomForestModel? If yes, how? If I cannot, I will have
>>>> to return false and retrain upon deployment. How do I skip pio train in
>>>> this case?
>>>> 2- How do I load saved randomForestModel from file? If I cannot, will I
>>>> remove object SomeModel extends PersistentModelLoader all together?
>>>> 3- How do I access sc within predict? Do I save a dummy RDD, load it in
>>>> apply, and say .context? In this case what happens to randomForestModel?
>>>>
>>>> I am really quite confused and could really appreciate some help/sample
>>>> code if you have time.
>>>> Thank you.
>>>> Hasan
>>>>
>>>>
>>>> On Mon, Sep 26, 2016 at 2:56 PM, Marcin Ziemiński <zieminm@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Hasan,
>>>>>
>>>>> So I guess, there are two things here:
>>>>> 1. You need SparkContext for predictions
>>>>> 2. You also need to retrain you model during loading
>>>>>
>>>>> Please, look at the definition of PersistentModel and the comments
>>>>> attached:
>>>>>
>>>>> trait PersistentModel[AP <: Params] {/** Save the model to some persistent
storage.
>>>>> *
>>>>> * This method should return true if the model has been saved successfully
so
>>>>> * that PredictionIO knows that it can be restored later during deployment.
>>>>> * This method should return false if the model cannot be saved (or should
>>>>> * not be saved due to configuration) so that PredictionIO will re-train
the
>>>>> * model during deployment. All arguments of this method are provided
by
>>>>> * automatically by PredictionIO.
>>>>> *
>>>>> * @param id ID of the run that trained this model.
>>>>> * @param params Algorithm parameters that were used to train this model.
>>>>> * @param sc An Apache Spark context.
>>>>> */def save(id: String, params: AP, sc: SparkContext): Boolean}
>>>>>
>>>>> In order to achieve the desired result you could simply use
>>>>> PersistentModel instead of LocalFileSystemPersistentModel and return
false
>>>>> from save. Then during deployment your model will be retrained through
your
>>>>> Algorithm implementation. You shouldn't need to retrain your model in
>>>>> implementations of PersistentModelLoader - this is rather for loading
>>>>> models, that are already trained and stored somewhere.
>>>>> You can save SparkContext instance provided to the train method for
>>>>> usage in predict(...) (assuming that your algorithm is an instance of
>>>>> PAlgorithm or P2LAlgorithm). Thus you should have what you need.
>>>>>
>>>>> Regards,
>>>>> Marcin
>>>>>
>>>>>
>>>>>
>>>>> pt., 23.09.2016 o 17:46 użytkownik Hasan Can Saral <
>>>>> hasancansaral@gmail.com> napisał:
>>>>>
>>>>>> Hi Marcin!
>>>>>>
>>>>>> Thank you for your answer.
>>>>>>
>>>>>> I do only need SparkContext, but have no idea on:
>>>>>> 1- How to retrieve it from PersitentModelLoader?
>>>>>> 2- How do I access sc in predict method using the configuration
>>>>>> below?
>>>>>>
>>>>>> class SomeModel() extends LocalFileSystemPersistentModel[SomeAlgorithmParams]
{
>>>>>>   override def save(id: String, params: SomeAlgorithmParams, sc:
SparkContext): Boolean = {
>>>>>>     false
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> object SomeModel extends LocalFileSystemPersistentModelLoader[SomeAlgorithmParams,
FraudModel] {
>>>>>>   override def apply(id: String, params: SomeAlgorithmParams, sc:
Option[SparkContext]): SomeModel = {
>>>>>>     new SomeModel() // HERE I TRAIN AND RETURN THE TRAINED MODEL
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> Thank you very much, I really appreciate it!
>>>>>>
>>>>>> Hasan
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 22, 2016 at 7:05 PM, Marcin Ziemiński <zieminm@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Hasan,
>>>>>>>
>>>>>>> I think that you problem comes from using deserialized RDD, which
>>>>>>> already lost its connection with SparkContext.
>>>>>>> Similar case could be found here:
>>>>>>> http://stackoverflow.com/questions/29567247/serializing-rdd
>>>>>>>
>>>>>>> If you only really need SparkContext you could probably use the
one
>>>>>>> provided to PersitentModelLoader, which would be implemented
by your model.
>>>>>>> Alternatively you could also implement PersistentModel to return
>>>>>>> false from save method. In this case your algorithm would be
retrained on
>>>>>>> deploy, what would also provide you with the instance of SparkContext.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Marcin
>>>>>>>
>>>>>>>
>>>>>>> czw., 22.09.2016 o 13:34 użytkownik Hasan Can Saral <
>>>>>>> hasancansaral@gmail.com> napisał:
>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> I am trying to query Event Server with PEventStore api in
predict
>>>>>>>> method to fetch events per entity to create my features.
PEventStore needs
>>>>>>>> sc, and for this, I have:
>>>>>>>>
>>>>>>>> - Extended PAlgorithm
>>>>>>>> - Extended LocalFileSystemPersistentModel and
>>>>>>>> LocalFileSystemPersistentModelLoader
>>>>>>>> - Put a dummy emptyRDD into my model
>>>>>>>> - Tried to access sc with model.dummyRDD.context to receive
this
>>>>>>>> error:
>>>>>>>>
>>>>>>>> org.apache.spark.SparkException: RDD transformations and
actions
>>>>>>>> can only be invoked by the driver, not inside of other transformations;
for
>>>>>>>> example, rdd1.map(x => rdd2.values.count() * x) is invalid
because the
>>>>>>>> values transformation and count action cannot be performed
inside of the
>>>>>>>> rdd1.map transformation. For more information, see SPARK-5063.
>>>>>>>>
>>>>>>>> Just like this user got it here
>>>>>>>> <https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE>
in
>>>>>>>> predictionio-user group. Any suggestions?
>>>>>>>>
>>>>>>>> Here's a more of my predict method:
>>>>>>>>
>>>>>>>> def predict(model: SomeModel, query: Query): PredictedResult
= {
>>>>>>>>
>>>>>>>>   def predict(model: SomeModel, query: Query): PredictedResult
= {
>>>>>>>>
>>>>>>>>
>>>>>>>>   val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName)
>>>>>>>>
>>>>>>>>       var previousEvents = try {
>>>>>>>>         PEventStore.find(
>>>>>>>>           appName = appName,
>>>>>>>>           entityType = Some(ap.entityType),
>>>>>>>>           entityId = Some(query.entityId.getOrElse(""))
>>>>>>>>         )(model.dummyRDD.context).map(event => {
>>>>>>>>
>>>>>>>>           Try(new CustomEvent(
>>>>>>>>             Some(event.event),
>>>>>>>>             Some(event.entityType),
>>>>>>>>             Some(event.entityId),
>>>>>>>>             Some(event.eventTime),
>>>>>>>>             Some(event.creationTime),
>>>>>>>>             Some(new Properties(
>>>>>>>>               *...*
>>>>>>>>             ))
>>>>>>>>           ))
>>>>>>>>         }).filter(_.isSuccess).map(_.get)
>>>>>>>>       } catch {
>>>>>>>>         case e: Exception => // fatal because of error,
an empty query
>>>>>>>>           logger.error(s"Error when reading events: ${e}")
>>>>>>>>           throw e
>>>>>>>>       }
>>>>>>>>
>>>>>>>>      ...
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Hasan Can Saral
>>>>>> hasancansaral@gmail.com
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Hasan Can Saral
>>>> hasancansaral@gmail.com
>>>>
>>>
>>>
>>
>
>
> --
>
> Hasan Can Saral
> hasancansaral@gmail.com
>

Mime
View raw message