predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shane Johnson <shanewaldenjohn...@gmail.com>
Subject Re: Using Dataframe API vs. RDD API?
Date Tue, 30 Jan 2018 18:53:37 GMT
I remember this now. Thanks Daniel. Does this confirm that I do indeed need
to use a spark context when using the new dataframe API (ml vs mllib)? I
wanted to make sure there wasn't a way to use the new ml library to predict
without using a dataframe.

*Shane Johnson | 801.360.3350*
LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
<https://www.facebook.com/shane.johnson.71653>

2018-01-30 7:09 GMT-10:00 Daniel O' Shaughnessy <danieljamesdavid@gmail.com>
:

> Hi Shane,
>
> You need to use PAlgorithm instead of P2Algorithm and save/load the spark
> context accordingly. This way you can use spark context in the predict
> function.
>
> There are examples of using PAlgorithm on the predictionio Site. It’s
> slightly more complicated but not too bad!
>
>
> On Tue, 30 Jan 2018 at 17:06, Shane Johnson <shanewaldenjohnson@gmail.com>
> wrote:
>
>> Thanks team! We are close to having our models working with the Dataframe
>> API. One additional roadblock we are hitting is the fundamental difference
>> in the RDD based API vs the Dataframe API. It seems that the old mllib API
>> would allow a simple vector to get predictions where in the new ml API a
>> dataframe is required. This presents a challenge as the predict function in
>> PredictionIO does not have a spark context.
>>
>> Any ideas how to overcome this? Am I thinking through this correctly or
>> are there other ways to get predictions with the new ml Dataframe API
>> without having a dataframe as input?
>>
>> Best,
>>
>> Shane
>>
>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>> <https://www.facebook.com/shane.johnson.71653>
>>
>> 2018-01-08 20:37 GMT-10:00 Donald Szeto <donald@apache.org>:
>>
>>> We do have work-in-progress for DataFrame API tracked at
>>> https://issues.apache.org/jira/browse/PIO-71.
>>>
>>> Chan, it would be nice if you could create a branch on your personal
>>> fork if you want to hand it off to someone else. Thanks!
>>>
>>> On Fri, Jan 5, 2018 at 2:02 PM, Pat Ferrel <pat@occamsmachete.com>
>>> wrote:
>>>
>>>> Yes and I do not recommend that because the EventServer schema is not a
>>>> developer contract. It may change at any time. Use the conversion method
>>>> and go through the PIO API to get the RDD then convert to DF for now.
>>>>
>>>> I’m not sure what PIO uses to get an RDD from Postgres but if they do
>>>> not use something like the lib you mention, a PR would be nice. Also if you
>>>> have an interest in adding the DF APIs to the EventServer contributions are
>>>> encouraged. Committers will give some guidance I’m sure—once that know
more
>>>> than me on the subject.
>>>>
>>>> If you want to donate some DF code, create a Jira and we’ll easily find
>>>> a mentor to make suggestions. There are many benefits to this including not
>>>> having to support a fork of PIO through subsequent versions. Also others
>>>> are interested in this too.
>>>>
>>>>
>>>>
>>>> On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <
>>>> danieljamesdavid@gmail.com> wrote:
>>>>
>>>> ....Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to
>>>> read in the RDD from a postgres DB initially.
>>>>
>>>> This was you don't need to use an EventServer!
>>>>
>>>> On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <
>>>> danieljamesdavid@gmail.com> wrote:
>>>>
>>>>> Hi Shane,
>>>>>
>>>>> I've successfully used :
>>>>>
>>>>> import org.apache.spark.ml.classification.{
>>>>> RandomForestClassificationModel, RandomForestClassifier }
>>>>>
>>>>> with pio. You can access feature importance through the
>>>>> RandomForestClassifier also.
>>>>>
>>>>> Very simple to convert RDDs to DFs as Pat mentioned, something like:
>>>>>
>>>>> val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2"
>>>>> )
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <pat@occamsmachete.com>
wrote:
>>>>>
>>>>>> Actually there are libs that will read DFs from HBase
>>>>>> https://svn.apache.org/repos/asf/hbase/hbase.apache.
>>>>>> org/trunk/_chapters/spark.html
>>>>>>
>>>>>> This is out of band with PIO and should not be used IMO because the
>>>>>> schema of the EventStore is not guaranteed to remain as-is. The safest
way
>>>>>> is to translate or get DFs integrated to PIO. I think there is an
existing
>>>>>> Jira that request Spark ML support, which assumes DFs.
>>>>>>
>>>>>>
>>>>>> On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pat@occamsmachete.com>
>>>>>> wrote:
>>>>>>
>>>>>> Funny you should ask this. Yes, we are working on a DF based
>>>>>> Universal Recommender but you have to convert the RDD into a DF since
PIO
>>>>>> does not read out data in the form of a DF (yet). This is a fairly
simple
>>>>>> step of maybe one line of code but would be better supported in PIO
itself.
>>>>>> The issue is that the EventStore uses libs that may not read out
DFs, but
>>>>>> RDDs. This is certainly the case with Elasticsearch, which provides
an RDD
>>>>>> lib. I haven’t seen one from them that read out DFs though it would
make a
>>>>>> lot of sense for ES especially.
>>>>>>
>>>>>> So TLDR; yes, just convert the RDD into a DF for now.
>>>>>>
>>>>>> Also please add a feature request as a PIO Jira ticket to look into
>>>>>> this. I for one would +1
>>>>>>
>>>>>>
>>>>>> On Jan 4, 2018, at 11:55 AM, Shane Johnson <
>>>>>> shanewaldenjohnson@gmail.com> wrote:
>>>>>>
>>>>>> Hello group, Happy new year! Does anyone have a working example or
>>>>>> template using the DataFrame API vs. the RDD based APIs. We are wanting
to
>>>>>> migrate to using the new DataFrame APIs to take advantage of the
*Feature
>>>>>> Importance* function for our Regression Random Forest Models.
>>>>>>
>>>>>> We are wanting to move from
>>>>>>
>>>>>> import org.apache.spark.mllib.tree.RandomForestimport org.apache.spark.mllib.tree.model.RandomForestModelimport
org.apache.spark.mllib.util.MLUtils
>>>>>>
>>>>>> to
>>>>>>
>>>>>> import org.apache.spark.ml.regression.{RandomForestRegressionModel,
RandomForestRegressor}
>>>>>>
>>>>>>
>>>>>> Is this something that should be fairly straightforward by adjusting
>>>>>> parameters and calling new classes within DASE or is it much more
involved
>>>>>> development.
>>>>>>
>>>>>> Thank You!
>>>>>>
>>>>>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>>>>>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>>>>>> <https://www.facebook.com/shane.johnson.71653>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>

Mime
View raw message