Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BAF43200BAB for ; Sat, 8 Oct 2016 07:22:32 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B988F160AF5; Sat, 8 Oct 2016 05:22:32 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 46B39160AE9 for ; Sat, 8 Oct 2016 07:22:30 +0200 (CEST) Received: (qmail 17663 invoked by uid 500); 8 Oct 2016 05:22:29 -0000 Mailing-List: contact commits-help@predictionio.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@predictionio.incubator.apache.org Delivered-To: mailing list commits@predictionio.incubator.apache.org Received: (qmail 17654 invoked by uid 99); 8 Oct 2016 05:22:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Oct 2016 05:22:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id DE4D21804B5 for ; Sat, 8 Oct 2016 05:22:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -6.219 X-Spam-Level: X-Spam-Status: No, score=-6.219 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id VCjg9szgI5Wg for ; Sat, 8 Oct 2016 05:22:15 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id EE0F360CD3 for ; Sat, 8 Oct 2016 05:22:02 +0000 (UTC) Received: (qmail 16167 invoked by uid 99); 8 Oct 2016 05:22:01 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Oct 2016 05:22:01 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 407CBEEF4C; Sat, 8 Oct 2016 05:22:01 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: donald@apache.org To: commits@predictionio.incubator.apache.org Date: Sat, 08 Oct 2016 05:22:17 -0000 Message-Id: <0a21c8e062c14c1980eb742be0d4d593@git.apache.org> In-Reply-To: <434d0e13795c4434822c859b99857a68@git.apache.org> References: <434d0e13795c4434822c859b99857a68@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [18/51] [abbrv] [partial] incubator-predictionio-site git commit: Documentation based on apache/incubator-predictionio#8d80086e4a69196a65628e4273705a3487838b93 archived-at: Sat, 08 Oct 2016 05:22:32 -0000 http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/system/index.html ---------------------------------------------------------------------- diff --git a/system/index.html b/system/index.html new file mode 100644 index 0000000..511ab0c --- /dev/null +++ b/system/index.html @@ -0,0 +1,6 @@ +System Architecture and Dependencies

During the installation, you have installed the following software:

  • Apache Hadoop 2.4.0 (required only if YARN and HDFS are needed)
  • Apache HBase 0.98.6
  • Apache Spark 1.2.0 for Hadoop 2.4
  • Elasticsearch 1.4.0

This section explains how they are used in PredictionIO.

PredictionIO Systems

HBase: Event Server uses Apache HBase as the data store. It stores imported events. If you are not using the PredictionIO Event Server, you do not need to install HBase.

Apache Spark: Spark is a large-scale data processing engine that powers the algorithm, training, and serving processing.

A spark algorithm is different from conventional single machine algorithm in a way that spark algorithms use the RDD abstraction as its primary data type. PredictionIO framework natively support both RDD-based algorithms and traditional single-machine algorithms.

HDFS: The output of training has two parts: a model and its meta-data. The model is then stored in HDFS or a local file system.

Elasticsearch: It stores metadata such as model versions, engine versions, access key and app id mappings, evaluation results, etc.

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/system/index.html.gz ---------------------------------------------------------------------- diff --git a/system/index.html.gz b/system/index.html.gz new file mode 100644 index 0000000..4e68c5e Binary files /dev/null and b/system/index.html.gz differ http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/templates/classification/add-algorithm/index.html ---------------------------------------------------------------------- diff --git a/templates/classification/add-algorithm/index.html b/templates/classification/add-algorithm/index.html new file mode 100644 index 0000000..8ba9ce8 --- /dev/null +++ b/templates/classification/add-algorithm/index.html @@ -0,0 +1,147 @@ +Using Alternative Algorithm

The classification template uses the Naive Bayes algorithm by default. You can easily add and use other MLlib classification algorithms. The following will demonstrate how to add the MLlib Random Forests algorithm into the engine.

Create a new f ile RandomForestAlgorithm.scala

Locate src/main/scala/NaiveBayesAlgorithm.scala under your engine directory, which should be /MyClassification if you are following the Classification QuickStart. Copy NaiveBayesAlgorithm.scala and create a new file RandomForestAlgorithm.scala. You will modify this file and follow the instructions below to define a new RandomForestAlgorithm class.

Define the algorithm class and parameters

In 'RandomForestAlgorithm.scala', import the MLlib Random Forests algorithm by changing the following lines:

Original

1
+2
import org.apache.spark.mllib.classification.NaiveBayes
+import org.apache.spark.mllib.classification.NaiveBayesModel
+

Change to:

1
+2
import org.apache.spark.mllib.tree.RandomForest // CHANGED
+import org.apache.spark.mllib.tree.model.RandomForestModel // CHANGED
+

These are the necessary classes in order to use the MLLib's Random Forest algorithm.

Modify the AlgorithmParams class for the Random Forest algorithm:

1
+2
+3
+4
+5
+6
+7
+8
+9
// CHANGED
+case class RandomForestAlgorithmParams(
+  numClasses: Int,
+  numTrees: Int,
+  featureSubsetStrategy: String,
+  impurity: String,
+  maxDepth: Int,
+  maxBins: Int
+) extends Params
+

This class defines the parameters of the Random Forest algorithm (which later you can specify the value in engine.json). Please refer to MLlib documentation for the description and usage of these parameters.

Modify the NaiveBayesAlgorithm class to RandomForestAlgorithm. The changes are:

  • The new RandomForestAlgorithmParams class is used as parameter.
  • RandomForestModel is used in type parameter. This is the model returned by the Random Forest algorithm.
  • the train() function is modified and it returns the RandomForestModel instead of NaiveBayesModel.
  • the predict() function takes the RandomForestModel as input.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
// extends P2LAlgorithm because the MLlib's RandomForestModel doesn't
+// contain RDD.
+class RandomForestAlgorithm(val ap: RandomForestAlgorithmParams) // CHANGED
+  extends P2LAlgorithm[PreparedData, RandomForestModel, // CHANGED
+  Query, PredictedResult] {
+
+  // CHANGED
+  def train(sc: SparkContext, data: PreparedData): RandomForestModel = {
+    // CHANGED
+    // Empty categoricalFeaturesInfo indicates all features are continuous.
+    val categoricalFeaturesInfo = Map[Int, Int]()
+    RandomForest.trainClassifier(
+      data.labeledPoints,
+      ap.numClasses,
+      categoricalFeaturesInfo,
+      ap.numTrees,
+      ap.featureSubsetStrategy,
+      ap.impurity,
+      ap.maxDepth,
+      ap.maxBins)
+  }
+
+  def predict(
+    model: RandomForestModel, // CHANGED
+    query: Query): PredictedResult = {
+
+    val label = model.predict(Vectors.dense(
+        query.attr0, query.attr1, query.attr2
+    ))
+    new PredictedResult(label)
+  }
+
+}
+

Note that the MLlib Random Forest algorithm takes the same training data as the Navie Bayes algorithm (ie, RDD[LabeledPoint]) so you don't need to modify the DataSource, TrainigData and PreparedData classes. If the new algorithm to be added requires different types of training data, then you need to modify these classes accordingly to accomodate your new algorithm.

Update Engine.scala

Modify the EngineFactory to add the new algorithm class RandomForestAlgorithm you just defined and give it a name "randomforest". The name will be used in engine.json to specify which algorithm to use.

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
object ClassificationEngine extends IEngineFactory {
+  def apply() = {
+    new Engine(
+      classOf[DataSource],
+      classOf[Preparator],
+      Map("naive" -> classOf[NaiveBayesAlgorithm],
+        "randomforest" -> classOf[RandomForestAlgorithm]), // ADDED
+      classOf[Serving])
+  }
+}
+

This engine factory now returns an engine with two algorithms and they are named as "naive" and "randomforest" respectively.

Update engine.json

In order to use the new algorithm, you need to modify engine.json to specify the name of the algorithm and the parameters.

Update the engine.json to use randomforest:

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
...
+"algorithms": [
+  {
+    "name": "randomforest",
+    "params": {
+      "numClasses": 3,
+      "numTrees": 5,
+      "featureSubsetStrategy": "auto",
+      "impurity": "gini",
+      "maxDepth": 4,
+      "maxBins": 100
+    }
+  }
+]
+...
+

The engine now uses MLlib Random Forests algorithm instead of the default Naive Bayes algorithm. You are ready to build, train and deploy the engine as described in quickstart.

1
+2
+3
$ pio build
+$ pio train
+$ pio deploy
+

To switch back using Naive Bayes algorithm, simply modify engine.json.

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/templates/classification/add-algorithm/index.html.gz ---------------------------------------------------------------------- diff --git a/templates/classification/add-algorithm/index.html.gz b/templates/classification/add-algorithm/index.html.gz new file mode 100644 index 0000000..179b885 Binary files /dev/null and b/templates/classification/add-algorithm/index.html.gz differ http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/templates/classification/dase/index.html ---------------------------------------------------------------------- diff --git a/templates/classification/dase/index.html b/templates/classification/dase/index.html new file mode 100644 index 0000000..c2e2025 --- /dev/null +++ b/templates/classification/dase/index.html @@ -0,0 +1,218 @@ +DASE Components Explained (Classification)

PredictionIO's DASE architecture brings the separation-of-concerns design principle to predictive engine development. DASE stands for the following components of an engine:

  • Data - includes Data Source and Data Preparator
  • Algorithm(s)
  • Serving
  • Evaluator

Let's look at the code and see how you can customize the engine you built from the Classification Engine Template.

Evaluator will not be covered in this tutorial. Please visit evaluation explained for using evaluation.

The Engine Design

As you can see from the Quick Start, MyClassification takes a JSON prediction query, e.g. { "attr0":4, "attr1":3, "attr2":8 }, and return a JSON predicted result.

for version < v0.3.1, it is array of features values: { "features": [4, 3, 8] }

In MyClassification/src/main/scala/Engine.scala, the Query case class defines the format of query, such as { "attr0":4, "attr1":3, "attr2":8 }:

1
+2
+3
+4
+5
+6
class Query(
+  val attr0 : Double,
+  val attr1 : Double,
+  val attr2 : Double
+) extends Serializable
+
+

The PredictedResult case class defines the format of predicted result, such as {"label":2.0}:

1
+2
+3
class PredictedResult(
+  val label: Double
+) extends Serializable
+

Finally, ClassificationEngine is the Engine Factory that defines the components this engine will use: Data Source, Data Preparator, Algorithm(s) and Serving components.

1
+2
+3
+4
+5
+6
+7
+8
+9
object ClassificationEngine extends IEngineFactory {
+  def apply() = {
+    new Engine(
+      classOf[DataSource],
+      classOf[Preparator],
+      Map("naive" -> classOf[NaiveBayesAlgorithm]),
+      classOf[Serving])
+  }
+}
+

Spark MLlib

Spark's MLlib NaiveBayes algorithm takes training data of RDD type, i.e. RDD[LabeledPoint] and train a model, which is a NaiveBayesModel object.

PredictionIO's MLlib Classification engine template, which MyClassification bases on, integrates this algorithm under the DASE architecture. We will take a closer look at the DASE code below.

Check this out to learn more about MLlib's NaiveBayes algorithm.

Data

In the DASE architecture, data is prepared by 2 components sequentially: Data Source and Data Preparator. Data Source and Data Preparator takes data from the data store and prepares RDD[LabeledPoint] for the NaiveBayes algorithm.< /p>

Data Source

In MyClassification/src/main/scala/DataSource.scala, the readTraining method of the class DataSource reads, and selects, data from datastore of EventServer and it returns TrainingData.

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
case class DataSourceParams(appName: String) extends Params
+
+class DataSource(val dsp: DataSourceParams)
+  extends PDataSource[TrainingData, EmptyEvaluationInfo, Query, EmptyActualResult] {
+
+  @transient lazy val logger = Logger[this.type]
+
+  override
+  def readTraining(sc: SparkContext): TrainingData = {
+
+    val labeledPoints: RDD[LabeledPoint] = PEventStore.aggregateProperties(
+      appName = dsp.appName,
+      entityType = "user",
+      // only keep entities with these required properties defined
+      required = Some(List("plan", "attr0", "attr1", "attr2")))(sc)
+      // aggregateProperties() returns RDD pair of
+      // entity ID and its aggregated properties
+      .map { case (entityId, properties) =>
+        try {
+          LabeledPoint(properties.get[Double]("plan"),
+            Vectors.dense(Array(
+              properties.get[Double]("attr0"),
+              properties.get[Double]("attr1"),
+              properties.get[Double]("attr2")
+            ))
+          )
+        } catch {
+          case e: Exception => {
+            logger.error(s"Failed to get properties ${properties} of" +
+              s" ${entityId}. Exception: ${e}.")
+            throw e
+          }
+        }
+      }.cache()
+
+    new TrainingData(labeledPoints)
+  }
+}
+

PEventStore is an object which provides function to access data that is collected through the Event Server, and PEventStore.aggregateProperties aggregates the event records of the 4 properties (attr0, attr1, attr2 and plan) for each user.

PredictionIO automatically loads the parameters of datasource specified in MyEngine/engine.json, including appName, to dsp.

In engine.json:

1
+2
+3
+4
+5
+6
+7
+8
+9
{
+  ...
+  "datasource": {
+    "params": {
+      "appName": "MyApp1"
+    }
+  },
+  ...
+}
+

In this sample text data file, columns are delimited by comma (,). The first column are labels. The second column are features.

The class definition of TrainingData is:

1
+2
+3
class TrainingData(
+  val labeledPoints: RDD[LabeledPoint]
+) extends Serializable
+

and PredictionIO passes the returned TrainingData object to Data Preparator.

Data Preparator

In MyClassification/src/main/scala/Preparator.scala, the prepare of class Preparator takes TrainingData. It then conducts any necessary feature selection and data processing tasks. At the end, it returns PreparedData which should contain the data Algorithm needs. For MLlib NaiveBayes, it is RDD[LabeledPoint].

By default, prepare simply copies the unprocessed TrainingData data to PreparedData:

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
class PreparedData(
+  val labeledPoints: RDD[LabeledPoint]
+) extends Serializable
+
+class Preparator
+  extends PPreparator[TrainingData, PreparedData] {
+
+  def prepare(sc: SparkContext, trainingData: TrainingData): PreparedData = {
+    new PreparedData(trainingData.labeledPoints)
+  }
+}
+

PredictionIO passes the returned PreparedData object to Algorithm's train function.

Algorithm

In MyClassification/src/main/scala/NaiveBayesAlgorithm.scala, the two methods of the algorithm class are train and predict. train is responsible for training a predictive model. PredictionIO will store this model and predict is responsible for using this model to make prediction.

train(...)

train is called when you run pio train. This is where MLlib NaiveBayes algorithm, i.e. NaiveBayes.train, is used to train a predictive model.

1
+2
+3
def train(sc: SparkContext, data: PreparedData): NaiveBayesModel = {
+    NaiveBayes.train(data.labeledPoints, ap.lambda)
+}
+

In addition to RDD[LabeledPoint] (i.e. data.labeledPoints), NaiveBayes.train takes 1 parameter: lambda.

The values of this parameter is specified in algorithms of MyClassification/engine.json:

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
{
+  ...
+  "algorithms": [
+    {
+      "name": "naive",
+      "params": {
+        "lambda": 1.0
+      }
+    }
+  ]
+  ...
+}
+

PredictionIO will automatically loads these values into the constructor ap, which has a corresponding case class AlgorithmParams:

1
+2
+3
case class AlgorithmParams(
+  lambda: Double
+) extends Params
+

NaiveBayes.train then returns a NaiveBayesModel model. PredictionIO will automatically store the returned model.

predict(...)

The predict method is called when you send a JSON query to http://localhost:8000/queries.json. PredictionIO converts the query, such as { "attr0":4, "attr1":3, "attr2":8 } to the Query class you defined previously.

The predictive model NaiveBayesModel of MLlib NaiveBayes offers a function called predict. predict takes a dense vector of features. It predicts the label of the item represented by this feature vector.

1
+2
+3
+4
+5
+6
  def predict(model: NaiveBayesModel, query: Query): PredictedResult = {
+    val label = model.predict(Vectors.dense(
+        query.attr0, query.attr1, query.attr2
+    ))
+    new PredictedResult(label)
+  }
+

You have defined the class PredictedResult earlier in this page.

PredictionIO passes the returned PredictedResult object to Serving.

Serving

The serve method of class Serving processes predicted result. It is also responsible for combining multiple predicted results into one if you have more than one predictive model. Serving then returns the final predicted result. PredictionIO will convert it to a JSON response automatically.

In MyClassification/src/main/scala/Serving.scala,

1
+2
+3
+4
+5
+6
+7
+8
+9
class Serving
+  extends LServing[Query, PredictedResult] {
+
+  override
+  def serve(query: Query,
+    predictedResults: Seq[PredictedResult]): PredictedResult = {
+    predictedResults.head
+  }
+}
+

When you send a JSON query to http://localhost:8000/queries.json, PredictedResult from all models will be passed to serve as a sequence, i.e. Seq[PredictedResult].

An engine can train multiple models if you specify more than one Algorithm component in object RecommendationEngine inside Engine.scala. Since only one NaiveBayesAlgorithm is implemented by default, this Seq contains one element.

In this case, serve simply returns the predicted result of the first, and the only, algorithm, i.e. predictedResults.head.

Congratulations! You have just learned how to customize and build a production-ready engine. Have fun!

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/a542290d/templates/classification/dase/index.html.gz ---------------------------------------------------------------------- diff --git a/templates/classification/dase/index.html.gz b/templates/classification/dase/index.html.gz new file mode 100644 index 0000000..967c03f Binary files /dev/null and b/templates/classification/dase/index.html.gz differ