predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Donald Szeto <don...@apache.org>
Subject Re: Using Dataframe API vs. RDD API?
Date Sun, 04 Feb 2018 00:06:54 GMT
Hi Shane,

You are correct about Spark ML requiring DataFrame/Dataset because the
generated model is in fact a Transformer which requires those input types.
Before we finished adding Spark ML support the workaround would be Daniel’s
suggestion.

Regards,
Donald

On Tue, Jan 30, 2018 at 10:53 AM Shane Johnson <shanewaldenjohnson@gmail.com>
wrote:

> I remember this now. Thanks Daniel. Does this confirm that I do indeed
> need to use a spark context when using the new dataframe API (ml vs mllib)?
> I wanted to make sure there wasn't a way to use the new ml library to
> predict without using a dataframe.
>
> *Shane Johnson | 801.360.3350*
> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
> <https://www.facebook.com/shane.johnson.71653>
>
> 2018-01-30 7:09 GMT-10:00 Daniel O' Shaughnessy <
> danieljamesdavid@gmail.com>:
>
>> Hi Shane,
>>
>> You need to use PAlgorithm instead of P2Algorithm and save/load the spark
>> context accordingly. This way you can use spark context in the predict
>> function.
>>
>> There are examples of using PAlgorithm on the predictionio Site. It’s
>> slightly more complicated but not too bad!
>>
>>
>> On Tue, 30 Jan 2018 at 17:06, Shane Johnson <shanewaldenjohnson@gmail.com>
>> wrote:
>>
>>> Thanks team! We are close to having our models working with the
>>> Dataframe API. One additional roadblock we are hitting is the fundamental
>>> difference in the RDD based API vs the Dataframe API. It seems that the old
>>> mllib API would allow a simple vector to get predictions where in the new
>>> ml API a dataframe is required. This presents a challenge as the predict
>>> function in PredictionIO does not have a spark context.
>>>
>>> Any ideas how to overcome this? Am I thinking through this correctly or
>>> are there other ways to get predictions with the new ml Dataframe API
>>> without having a dataframe as input?
>>>
>>> Best,
>>>
>>> Shane
>>>
>>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>>> <https://www.facebook.com/shane.johnson.71653>
>>>
>>> 2018-01-08 20:37 GMT-10:00 Donald Szeto <donald@apache.org>:
>>>
>>>> We do have work-in-progress for DataFrame API tracked at
>>>> https://issues.apache.org/jira/browse/PIO-71.
>>>>
>>>> Chan, it would be nice if you could create a branch on your personal
>>>> fork if you want to hand it off to someone else. Thanks!
>>>>
>>>> On Fri, Jan 5, 2018 at 2:02 PM, Pat Ferrel <pat@occamsmachete.com>
>>>> wrote:
>>>>
>>>>> Yes and I do not recommend that because the EventServer schema is not
>>>>> a developer contract. It may change at any time. Use the conversion method
>>>>> and go through the PIO API to get the RDD then convert to DF for now.
>>>>>
>>>>> I’m not sure what PIO uses to get an RDD from Postgres but if they
do
>>>>> not use something like the lib you mention, a PR would be nice. Also
if you
>>>>> have an interest in adding the DF APIs to the EventServer contributions
are
>>>>> encouraged. Committers will give some guidance I’m sure—once that
know more
>>>>> than me on the subject.
>>>>>
>>>>> If you want to donate some DF code, create a Jira and we’ll easily
>>>>> find a mentor to make suggestions. There are many benefits to this
>>>>> including not having to support a fork of PIO through subsequent versions.
>>>>> Also others are interested in this too.
>>>>>
>>>>>
>>>>>
>>>>> On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <
>>>>> danieljamesdavid@gmail.com> wrote:
>>>>>
>>>>> ....Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to
>>>>> read in the RDD from a postgres DB initially.
>>>>>
>>>>> This was you don't need to use an EventServer!
>>>>>
>>>>> On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <
>>>>> danieljamesdavid@gmail.com> wrote:
>>>>>
>>>>>> Hi Shane,
>>>>>>
>>>>>> I've successfully used :
>>>>>>
>>>>>> import org.apache.spark.ml.classification.{
>>>>>> RandomForestClassificationModel, RandomForestClassifier }
>>>>>>
>>>>>> with pio. You can access feature importance through the
>>>>>> RandomForestClassifier also.
>>>>>>
>>>>>> Very simple to convert RDDs to DFs as Pat mentioned, something like:
>>>>>>
>>>>>> val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1",
>>>>>> "col2")
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <pat@occamsmachete.com>
wrote:
>>>>>>
>>>>>>> Actually there are libs that will read DFs from HBase
>>>>>>> https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
>>>>>>>
>>>>>>> This is out of band with PIO and should not be used IMO because
the
>>>>>>> schema of the EventStore is not guaranteed to remain as-is. The
safest way
>>>>>>> is to translate or get DFs integrated to PIO. I think there is
an existing
>>>>>>> Jira that request Spark ML support, which assumes DFs.
>>>>>>>
>>>>>>>
>>>>>>> On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pat@occamsmachete.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Funny you should ask this. Yes, we are working on a DF based
>>>>>>> Universal Recommender but you have to convert the RDD into a
DF since PIO
>>>>>>> does not read out data in the form of a DF (yet). This is a fairly
simple
>>>>>>> step of maybe one line of code but would be better supported
in PIO itself.
>>>>>>> The issue is that the EventStore uses libs that may not read
out DFs, but
>>>>>>> RDDs. This is certainly the case with Elasticsearch, which provides
an RDD
>>>>>>> lib. I haven’t seen one from them that read out DFs though
it would make a
>>>>>>> lot of sense for ES especially.
>>>>>>>
>>>>>>> So TLDR; yes, just convert the RDD into a DF for now.
>>>>>>>
>>>>>>> Also please add a feature request as a PIO Jira ticket to look
into
>>>>>>> this. I for one would +1
>>>>>>>
>>>>>>>
>>>>>>> On Jan 4, 2018, at 11:55 AM, Shane Johnson <
>>>>>>> shanewaldenjohnson@gmail.com> wrote:
>>>>>>>
>>>>>>> Hello group, Happy new year! Does anyone have a working example
or
>>>>>>> template using the DataFrame API vs. the RDD based APIs. We are
wanting to
>>>>>>> migrate to using the new DataFrame APIs to take advantage of
the *Feature
>>>>>>> Importance* function for our Regression Random Forest Models.
>>>>>>>
>>>>>>> We are wanting to move from
>>>>>>>
>>>>>>> import org.apache.spark.mllib.tree.RandomForestimport org.apache.spark.mllib.tree.model.RandomForestModelimport
org.apache.spark.mllib.util.MLUtils
>>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>> import org.apache.spark.ml.regression.{RandomForestRegressionModel,
RandomForestRegressor}
>>>>>>>
>>>>>>>
>>>>>>> Is this something that should be fairly straightforward by adjusting
>>>>>>> parameters and calling new classes within DASE or is it much
more involved
>>>>>>> development.
>>>>>>>
>>>>>>> Thank You!
>>>>>>>
>>>>>>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>>>>>>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> |
Facebook
>>>>>>> <https://www.facebook.com/shane.johnson.71653>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>

Mime
View raw message