; Sat, 8 Oct 2016 05:22:02 +0000 (UTC) Received: (qmail 16167 invoked by uid 99); 8 Oct 2016 05:22:01 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Oct 2016 05:22:01 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 407CBEEF4C; Sat, 8 Oct 2016 05:22:01 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: donald@apache.org To: commits@predictionio.incubator.apache.org Date: Sat, 08 Oct 2016 05:22:17 -0000 Message-Id: <0a21c8e062c14c1980eb742be0d4d593@git.apache.org> In-Reply-To: <434d0e13795c4434822c859b99857a68@git.apache.org> References: <434d0e13795c4434822c859b99857a68@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [18/51] [abbrv] [partial] incubator-predictionio-site git commit: Documentation based on apache/incubator-predictionio#8d80086e4a69196a65628e4273705a3487838b93 archived-at: Sat, 08 Oct 2016 05:22:32 -0000 http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/system/index.html ---------------------------------------------------------------------- diff --git a/system/index.html b/system/index.html new file mode 100644 index 0000000..511ab0c --- /dev/null +++ b/system/index.html @@ -0,0 +1,6 @@ +System Architecture and Dependencies

PredictionIO Docs

Architecture Overview

PredictionIO Docs

System Architecture and Dependencies

Edit this page

System Architecture and Dependencies

During the installation, you have installed the following software:

Apache Hadoop 2.4.0 (required only if YARN and HDFS are needed)
Apache HBase 0.98.6
Apache Spark 1.2.0 for Hadoop 2.4
Elasticsearch 1.4.0

This section explains how they are used in PredictionIO.

PredictionIO Systems

HBase: Event Server uses Apache HBase as the data store. It stores imported events. If you are not using the PredictionIO Event Server, you do not need to install HBase.

Apache Spark: Spark is a large-scale data processing engine that powers the algorithm, training, and serving processing.

A spark algorithm is different from conventional single machine algorithm in a way that spark algorithms use the RDD abstraction as its primary data type. PredictionIO framework natively support both RDD-based algorithms and traditional single-machine algorithms.

HDFS: The output of training has two parts: a model and its meta-data. The model is then stored in HDFS or a local file system.

Elasticsearch: It stores metadata such as model versions, engine versions, access key and app id mappings, evaluation results, etc.

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/system/index.html.gz ---------------------------------------------------------------------- diff --git a/system/index.html.gz b/system/index.html.gz new file mode 100644 index 0000000..4e68c5e Binary files /dev/null and b/system/index.html.gz differ http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/templates/classification/add-algorithm/index.html ---------------------------------------------------------------------- diff --git a/templates/classification/add-algorithm/index.html b/templates/classification/add-algorithm/index.html new file mode 100644 index 0000000..8ba9ce8 --- /dev/null +++ b/templates/classification/add-algorithm/index.html @@ -0,0 +1,147 @@ +Using Alternative Algorithm
TEMPLATES OPEN SOURCE
PredictionIO Docs
Using Alternative Algorithm
PredictionIO Docs
Apache PredictionIO (incubating) Documentation
Welcome to Apache PredictionIO (incubating)
Getting Started
A Quick Intro
Installing Apache PredictionIO (incubating)
Downloading an Engine Template
Deploying Your First Engine
Customizing the Engine
Integrating with Your App
App Integration Overview
List of SDKs
Java & Android SDK
PHP SDK
Python SDK
Ruby SDK
Community Powered SDKs
Deploying an Engine
Deploying as a Web Service
Engine Command-line Interface
Monitoring Engine
Setting Engine Parameters
Deploying Multiple Engine Variants
Customizing an Engine
Learning DASE
Implement DASE
Troubleshooting Engine Development
Engine Scala APIs
Collecting and Analyzing Data
Event Server Overview
Event Server Command-line Interface
Collecting Data with REST/SDKs
Events Modeling
Unifying Multichannel Data with Webhooks
Channel
Importing Data in Batch
Using Analytics Tools
Choosing an Alg orithm(s)
Built-in Algorithm Libraries
Switching to Another Algorithm
Combining Multiple Algorithms
Adding Your Own Algorithms
ML Tuning and Evaluation
Overview
Hyperparameter Tuning
Evaluation Dashboard
Choosing Evaluation Metri cs
Building Evaluation Metrics
System Architecture
Architecture Overview
Using Another Data Store
Engine Template Gallery
Browse
Submit your Engine as a Template
Demo Tutorials
Comics Recommendation Demo
Community Contributed Demo
Text Classification Engine Tutorial
Getting Involved
Contribute Code
Contribute Documentation
Contribute a SDK
Contribute a Webhook
Community Projects
Getting Help
FAQs
Support
Resources
Developing Engines with IntelliJ IDEA
Upgrade Instructions
Glossary
Using Alternative Algorithm
On this page

Create a new file RandomForestAlgorithm.scala

Define the algorithm class and parameters

Update Engine.scala

Update engine.json

Edit this page
Using Alternative Algorithm
The classification template uses the Naive Bayes algorithm by default. You can easily add and use other MLlib classification algorithms. The following will demonstrate how to add the MLlib Random Forests algorithm into the engine.
Create a new f ile RandomForestAlgorithm.scala
Locate src/main/scala/NaiveBayesAlgorithm.scala under your engine directory, which should be /MyClassification if you are following the Classification QuickStart. Copy NaiveBayesAlgorithm.scala and create a new file RandomForestAlgorithm.scala. You will modify this file and follow the instructions below to define a new RandomForestAlgorithm class.
Define the algorithm class and parameters
In 'RandomForestAlgorithm.scala', import the MLlib Random Forests algorithm by changing the following lines:
Original
1 +2
import org.apache.spark.mllib.classification.NaiveBayes +import org.apache.spark.mllib.classification.NaiveBayesModel +

Change to:
1 +2
import org.apache.spark.mllib.tree.RandomForest // CHANGED +import org.apache.spark.mllib.tree.model.RandomForestModel // CHANGED +

These are the necessary classes in order to use the MLLib's Random Forest algorithm.
Modify the AlgorithmParams class for the Random Forest algorithm:
1 +2 +3 +4 +5 +6 +7 +8 +9
// CHANGED +case class RandomForestAlgorithmParams( + numClasses: Int, + numTrees: Int, + featureSubsetStrategy: String, + impurity: String, + maxDepth: Int, + maxBins: Int +) extends Params +

This class defines the parameters of the Random Forest algorithm (which later you can specify the value in engine.json). Please refer to MLlib documentation for the description and usage of these parameters.
Modify the NaiveBayesAlgorithm class to RandomForestAlgorithm. The changes are:

The new RandomForestAlgorithmParams class is used as parameter.

RandomForestModel is used in type parameter. This is the model returned by the Random Forest algorithm.

the train() function is modified and it returns the RandomForestModel instead of NaiveBayesModel.

the predict() function takes the RandomForestModel as input.

1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33
// extends P2LAlgorithm because the MLlib's RandomForestModel doesn't +// contain RDD. +class RandomForestAlgorithm(val ap: RandomForestAlgorithmParams) // CHANGED + extends P2LAlgorithm[PreparedData, RandomForestModel, // CHANGED + Query, PredictedResult] { + + // CHANGED + def train(sc: SparkContext, data: PreparedData): RandomForestModel = { + // CHANGED + // Empty categoricalFeaturesInfo indicates all features are continuous. + val categoricalFeaturesInfo = Map[Int, Int]() + RandomForest.trainClassifier( + data.labeledPoints, + ap.numClasses, + categoricalFeaturesInfo, + ap.numTrees, + ap.featureSubsetStrategy, + ap.impurity, + ap.maxDepth, + ap.maxBins) + } + + def predict( + model: RandomForestModel, // CHANGED + query: Query): PredictedResult = { + + val label = model.predict(Vectors.dense( + query.attr0, query.attr1, query.attr2 + )) + new PredictedResult(label) + } + +} +

Note that the MLlib Random Forest algorithm takes the same training data as the Navie Bayes algorithm (ie, RDD[LabeledPoint]) so you don't need to modify the DataSource, TrainigData and PreparedData classes. If the new algorithm to be added requires different types of training data, then you need to modify these classes accordingly to accomodate your new algorithm.
Update Engine.scala
Modify the EngineFactory to add the new algorithm class RandomForestAlgorithm you just defined and give it a name "randomforest". The name will be used in engine.json to specify which algorithm to use.
1 +2 +3 +4 +5 +6 +7 +8 +9 +10
object ClassificationEngine extends IEngineFactory { + def apply() = { + new Engine( + classOf[DataSource], + classOf[Preparator], + Map("naive" -> classOf[NaiveBayesAlgorithm], + "randomforest" -> classOf[RandomForestAlgorithm]), // ADDED + classOf[Serving]) + } +} +

This engine factory now returns an engine with two algorithms and they are named as "naive" and "randomforest" respectively.
Update engine.json
In order to use the new algorithm, you need to modify engine.json to specify the name of the algorithm and the parameters.
Update the engine.json to use randomforest:
1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15
... +"algorithms": [ + { + "name": "randomforest", + "params": { + "numClasses": 3, + "numTrees": 5, + "featureSubsetStrategy": "auto", + "impurity": "gini", + "maxDepth": 4, + "maxBins": 100 + } + } +] +... +

The engine now uses MLlib Random Forests algorithm instead of the default Naive Bayes algorithm. You are ready to build, train and deploy the engine as described in quickstart.
1 +2 +3
$ pio build +$ pio train +$ pio deploy +

To switch back using Naive Bayes algorithm, simply modify engine.json.
Community
Download
Docs
GitHub
Subscribe to User Mailing List
Stackoverflow
Contribute
Contribute
Source Code
Bug Tracker
Subscribe to Development Mailing List
Star Fork
\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/templates/classification/add-algorithm/index.html.gz ---------------------------------------------------------------------- diff --git a/templates/classification/add-algorithm/index.html.gz b/templates/classification/add-algorithm/index.html.gz new file mode 100644 index 0000000..179b885 Binary files /dev/null and b/templates/classification/add-algorithm/index.html.gz differ http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/templates/classification/dase/index.html ---------------------------------------------------------------------- diff --git a/templates/classification/dase/index.html b/templates/classification/dase/index.html new file mode 100644 index 0000000..c2e2025 --- /dev/null +++ b/templates/classification/dase/index.html @@ -0,0 +1,218 @@ +DASE Components Explained (Classification)
TEMPLATES OPEN SOURCE
PredictionIO Docs
DASE Components Explained (Classification)
PredictionIO Docs
Apache PredictionIO (incubating) Documentation
Welcome to Apache PredictionIO (incubating)
Getting Started
A Quick Intro
Installing Apache PredictionIO (incubating)
Downloading an Engine Template
Deploying Your First Engine
Customizing the Engine
Integrating with Your App
App Integration Overview
List of SDKs
Java & Android SDK
PHP SDK
Python SDK
Ruby SDK
Community Powered SDKs
Deploying an Engine
Deploying as a Web Service
Engine Command-line Interface
Monitoring Engine
Setting Engine Parameters
Deploying Multiple Engine Variants
Customizing an Engine
Learning DASE
Implement DASE
Troubleshooting Engine Development
Engine Scala APIs
Collecting and Anal yzing Data
Event Server Overview
Event Server Command-line Interface
Collecting Data with REST/SDKs
Events Modeling
Unifying Multichannel Data with Webhooks
Channel
Importing Data in Batch
Using Analytics Tools
Choosing an Algorithm(s)
Built-in Algorithm Libraries
Switching to Another Algorithm
Combining Multiple Algorithms
Adding Your Own Algorithms
ML Tuning and Evaluation
Overview
Hyperparameter Tuning
Evaluation Dashboard
Choosing Evaluation Metrics
Building Evaluation Metrics
System Architecture
Architecture Overview
Using Another Data Store
Engine Template Gallery
Browse
Submit your Engine as a Template
Demo Tutorials
Comics Recommendation Demo
Community Contributed Demo
Text Classification Engine Tutorial
Getting Involved
Contribute Code
Contribute Documentation
Contribute a SDK
Contribute a Webhook
Community Projects
Getting Help
FAQs
Support
Resources
Developing Engines with IntelliJ IDEA
Upgrade Instructions
Glossary
DASE Components Explained (Classification)
On this page

The Engine Design

Data

Algorithm

Serving

Edit this page
DASE Components Explained (Classification)
PredictionIO's DASE architecture brings the separation-of-concerns design principle to predictive engine development. DASE stands for the following components of an engine:

Data - includes Data Source and Data Preparator

Algorithm(s)

Serving

Evaluator

Let's look at the code and see how you can customize the engine you built from the Classification Engine Template.
Evaluator will not be covered in this tutorial. Please visit evaluation explained for using evaluation.
The Engine Design
As you can see from the Quick Start, MyClassification takes a JSON prediction query, e.g. { "attr0":4, "attr1":3, "attr2":8 }, and return a JSON predicted result.
for version < v0.3.1, it is array of features values: { "features": [4, 3, 8] }
In MyClassification/src/main/scala/Engine.scala, the Query case class defines the format of query, such as { "attr0":4, "attr1":3, "attr2":8 }:
1 +2 +3 +4 +5 +6
class Query( + val attr0 : Double, + val attr1 : Double, + val attr2 : Double +) extends Serializable + +

The PredictedResult case class defines the format of predicted result, such as {"label":2.0}:
1 +2 +3
class PredictedResult( + val label: Double +) extends Serializable +

Finally, ClassificationEngine is the Engine Factory that defines the components this engine will use: Data Source, Data Preparator, Algorithm(s) and Serving components.
1 +2 +3 +4 +5 +6 +7 +8 +9
object ClassificationEngine extends IEngineFactory { + def apply() = { + new Engine( + classOf[DataSource], + classOf[Preparator], + Map("naive" -> classOf[NaiveBayesAlgorithm]), + classOf[Serving]) + } +} +

Spark MLlib
Spark's MLlib NaiveBayes algorithm takes training data of RDD type, i.e. RDD[LabeledPoint] and train a model, which is a NaiveBayesModel object.
PredictionIO's MLlib Classification engine template, which MyClassification bases on, integrates this algorithm under the DASE architecture. We will take a closer look at the DASE code below.

Check this out to learn more about MLlib's NaiveBayes algorithm.

Data
In the DASE architecture, data is prepared by 2 components sequentially: Data Source and Data Preparator. Data Source and Data Preparator takes data from the data store and prepares RDD[LabeledPoint] for the NaiveBayes algorithm.< /p>
Data Source
In MyClassification/src/main/scala/DataSource.scala, the readTraining method of the class DataSource reads, and selects, data from datastore of EventServer and it returns TrainingData.
1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38
case class DataSourceParams(appName: String) extends Params + +class DataSource(val dsp: DataSourceParams) + extends PDataSource[TrainingData, EmptyEvaluationInfo, Query, EmptyActualResult] { + + @transient lazy val logger = Logger[this.type] + + override + def readTraining(sc: SparkContext): TrainingData = { + + val labeledPoints: RDD[LabeledPoint] = PEventStore.aggregateProperties( + appName = dsp.appName, + entityType = "user", + // only keep entities with these required properties defined + required = Some(List("plan", "attr0", "attr1", "attr2")))(sc) + // aggregateProperties() returns RDD pair of + // entity ID and its aggregated properties + .map { case (entityId, properties) => + try { + LabeledPoint(properties.get[Double]("plan"), + Vectors.dense(Array( + properties.get[Double]("attr0"), + properties.get[Double]("attr1"), + properties.get[Double]("attr2") + )) + ) + } catch { + case e: Exception => { + logger.error(s"Failed to get properties ${properties} of" + + s" ${entityId}. Exception: ${e}.") + throw e + } + } + }.cache() + + new TrainingData(labeledPoints) + } +} +

PEventStore is an object which provides function to access data that is collected through the Event Server, and PEventStore.aggregateProperties aggregates the event records of the 4 properties (attr0, attr1, attr2 and plan) for each user.
PredictionIO automatically loads the parameters of datasource specified in MyEngine/engine.json, including appName, to dsp.
In engine.json:
1 +2 +3 +4 +5 +6 +7 +8 +9
{ + ... + "datasource": { + "params": { + "appName": "MyApp1" + } + }, + ... +} +

In this sample text data file, columns are delimited by comma (,). The first column are labels. The second column are features.
The class definition of TrainingData is:
1 +2 +3
class TrainingData( + val labeledPoints: RDD[LabeledPoint] +) extends Serializable +

and PredictionIO passes the returned TrainingData object to Data Preparator.
Data Preparator
In MyClassification/src/main/scala/Preparator.scala, the prepare of class Preparator takes TrainingData. It then conducts any necessary feature selection and data processing tasks. At the end, it returns PreparedData which should contain the data Algorithm needs. For MLlib NaiveBayes, it is RDD[LabeledPoint].
By default, prepare simply copies the unprocessed TrainingData data to PreparedData:
1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11
class PreparedData( + val labeledPoints: RDD[LabeledPoint] +) extends Serializable + +class Preparator + extends PPreparator[TrainingData, PreparedData] { + + def prepare(sc: SparkContext, trainingData: TrainingData): PreparedData = { + new PreparedData(trainingData.labeledPoints) + } +} +

PredictionIO passes the returned PreparedData object to Algorithm's train function.
Algorithm
In MyClassification/src/main/scala/NaiveBayesAlgorithm.scala, the two methods of the algorithm class are train and predict. train is responsible for training a predictive model. PredictionIO will store this model and predict is responsible for using this model to make prediction.
train(...)
train is called when you run pio train. This is where MLlib NaiveBayes algorithm, i.e. NaiveBayes.train, is used to train a predictive model.
1 +2 +3
def train(sc: SparkContext, data: PreparedData): NaiveBayesModel = { + NaiveBayes.train(data.labeledPoints, ap.lambda) +} +

In addition to RDD[LabeledPoint] (i.e. data.labeledPoints), NaiveBayes.train takes 1 parameter: lambda.
The values of this parameter is specified in algorithms of MyClassification/engine.json:
1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12
{ + ... + "algorithms": [ + { + "name": "naive", + "params": { + "lambda": 1.0 + } + } + ] + ... +} +

PredictionIO will automatically loads these values into the constructor ap, which has a corresponding case class AlgorithmParams:
1 +2 +3
case class AlgorithmParams( + lambda: Double +) extends Params +

NaiveBayes.train then returns a NaiveBayesModel model. PredictionIO will automatically store the returned model.
predict(...)
The predict method is called when you send a JSON query to http://localhost:8000/queries.json. PredictionIO converts the query, such as { "attr0":4, "attr1":3, "attr2":8 } to the Query class you defined previously.
The predictive model NaiveBayesModel of MLlib NaiveBayes offers a function called predict. predict takes a dense vector of features. It predicts the label of the item represented by this feature vector.
1 +2 +3 +4 +5 +6
def predict(model: NaiveBayesModel, query: Query): PredictedResult = { + val label = model.predict(Vectors.dense( + query.attr0, query.attr1, query.attr2 + )) + new PredictedResult(label) + } +

You have defined the class PredictedResult earlier in this page.

PredictionIO passes the returned PredictedResult object to Serving.
Serving
The serve method of class Serving processes predicted result. It is also responsible for combining multiple predicted results into one if you have more than one predictive model. Serving then returns the final predicted result. PredictionIO will convert it to a JSON response automatically.
In MyClassification/src/main/scala/Serving.scala,
1 +2 +3 +4 +5 +6 +7 +8 +9
class Serving + extends LServing[Query, PredictedResult] { + + override + def serve(query: Query, + predictedResults: Seq[PredictedResult]): PredictedResult = { + predictedResults.head + } +} +

When you send a JSON query to http://localhost:8000/queries.json, PredictedResult from all models will be passed to serve as a sequence, i.e. Seq[PredictedResult].

An engine can train multiple models if you specify more than one Algorithm component in object RecommendationEngine inside Engine.scala. Since only one NaiveBayesAlgorithm is implemented by default, this Seq contains one element.

In this case, serve simply returns the predicted result of the first, and the only, algorithm, i.e. predictedResults.head.
Congratulations! You have just learned how to customize and build a production-ready engine. Have fun!
Community
Download
Docs
GitHub
Subscribe to User Mailing List
Stackoverflow
Contribute
Contribute
Source Code
Bug Tracker
Subscribe to Development Mailing List
Star Fork
\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/templates/classification/dase/index.html.gz ---------------------------------------------------------------------- diff --git a/templates/classification/dase/index.html.gz b/templates/classification/dase/index.html.gz new file mode 100644 index 0000000..967c03f Binary files /dev/null and b/templates/classification/dase/index.html.gz differ