predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Using Dataframe API vs. RDD API?
Date Thu, 04 Jan 2018 23:10:32 GMT
Actually there are libs that will read DFs from HBase https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
<https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html>

This is out of band with PIO and should not be used IMO because the schema of the EventStore
is not guaranteed to remain as-is. The safest way is to translate or get DFs integrated to
PIO. I think there is an existing Jira that request Spark ML support, which assumes DFs. 


On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

Funny you should ask this. Yes, we are working on a DF based Universal Recommender but you
have to convert the RDD into a DF since PIO does not read out data in the form of a DF (yet).
This is a fairly simple step of maybe one line of code but would be better supported in PIO
itself. The issue is that the EventStore uses libs that may not read out DFs, but RDDs. This
is certainly the case with Elasticsearch, which provides an RDD lib. I haven’t seen one
from them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for one would
+1


On Jan 4, 2018, at 11:55 AM, Shane Johnson <shanewaldenjohnson@gmail.com <mailto:shanewaldenjohnson@gmail.com>>
wrote:

Hello group, Happy new year! Does anyone have a working example or template using the DataFrame
API vs. the RDD based APIs. We are wanting to migrate to using the new DataFrame APIs to take
advantage of the Feature Importance function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters and calling
new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350

LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook <https://www.facebook.com/shane.johnson.71653>


Mime
View raw message