From commits-return-4708-archive-asf-public=cust-asf.ponee.io@predictionio.apache.org Fri Mar 2 00:00:24 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id DE758180679 for ; Fri, 2 Mar 2018 00:00:22 +0100 (CET) Received: (qmail 3115 invoked by uid 500); 1 Mar 2018 23:00:22 -0000 Mailing-List: contact commits-help@predictionio.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@predictionio.apache.org Delivered-To: mailing list commits@predictionio.apache.org Received: (qmail 3105 invoked by uid 99); 1 Mar 2018 23:00:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Mar 2018 23:00:21 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 787F81A09BC for ; Thu, 1 Mar 2018 23:00:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.531 X-Spam-Level: X-Spam-Status: No, score=-4.531 tagged_above=-999 required=6.31 tests=[KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id HrQUVtkpN-JS for ; Thu, 1 Mar 2018 23:00:12 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id 5A2C15FB70 for ; Thu, 1 Mar 2018 23:00:11 +0000 (UTC) Received: (qmail 2506 invoked by uid 99); 1 Mar 2018 23:00:10 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Mar 2018 23:00:10 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 898E0F4E63; Thu, 1 Mar 2018 23:00:10 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: git-site-role@apache.org To: commits@predictionio.incubator.apache.org Date: Thu, 01 Mar 2018 23:00:38 -0000 Message-Id: <8be85b3a53d04cffb663010b1e65e818@git.apache.org> In-Reply-To: <2c1d40977a2e4756bf33d240006d626d@git.apache.org> References: <2c1d40977a2e4756bf33d240006d626d@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [30/51] [partial] predictionio-site git commit: Documentation based on apache/predictionio#875b98020e02dbf2441e867429a57c00f52375d5 http://git-wip-us.apache.org/repos/asf/predictionio-site/blob/1edaec49/demo/supervisedlearning/index.html ---------------------------------------------------------------------- diff --git a/demo/supervisedlearning/index.html b/demo/supervisedlearning/index.html index 98d5f51..ea3d912 100644 --- a/demo/supervisedlearning/index.html +++ b/demo/supervisedlearning/index.html @@ -1,4 +1,4 @@ -Machine Learning With PredictionIO

This guide is designed to give developers a brief introduction to fundamental concepts in machine learning, as well as an explanation of how these concept tie into PredictionIO's engine development platform. This particular guide will largely deal with giving some

Introduction to Supervised Learning

The first question we must ask is: what is machine learning? Machine learning is the field of study at the intersection of computer science, engineering, mathematics, and statistics which seeks to discover or infer patterns hidden within a set of observations, which we call our data. Some examples of problems that machine learning seeks to solve are:

  • Predict whether a patient has breast cancer based on their mammogram results.
  • Predict whether an e-mail is spam or not based on the e-mail's content.
  • Predict today's temperature based on climate variables collected for the previous week.

Thinking About Data

In the latter examples, we are trying to predict an outcome \(Y\), or response, based on some recorded or observed variables \(X\), or features. For example: in the third problem each observation is a patient, the response variable \(Y\) is equal to 1 if this patient has breast cancer and 0 otherwise, and \(X\) represents the mammogram resu lts.

When we say we want to predict \(Y\) using \(X\), we are trying to answer the question: how does a response \(Y\) depend on a set of features \(X\) affect the response \(Y\)? To do this we need a set of observations, which we call our training data, consisting of observations for which we have observed both \(Y\) and \(X\), in order to make inference about this relationship.

Different Types of Supervised Learning Problems

Note that in the first two examples, the outcome \(Y\) can only take on two values (1 : cancer/spam, 0: no cancer/ no spam). Whenever the outcome variable \(Y\) denotes a label associated to a particular group of observations (i.e. cancer group), the supervised learning problem is also called a classification problem. In the third example, however, \(Y\) can take on any numerical value since it denotes some temperat ure reading (i.e. 25.143, 25.14233, 32.0). These types of supervised learning problems are also called regression problems.

Training a Predictive Model

A predictive model should be thought of as a function \(f\) that takes as input a set of features, and outputs a predicted outcome (i.e. \(f(X) = Y\)). The phrase training a model simply refers to the process of using the training data to estimate such a function.

PredictionIO and Supervised Learning

Machine learning methods generally assume that our observation responses and features are numeric vectors. We will say that observations in this format are in standard form. However, when you are working with real-life data this will often not be the case. The data will often be formatted in a manner that is specific to the application' s needs. As an example, let's suppose our application is StackOverFlow. The data we want to analyze are questions, and we want to predict based on a question's content whether or not it is related to Scala.

Self-check: Is this a classification or regression problem?

Thinking About Data With PredictionIO

PredictionIO's predictive engine development platform allows you to easily incorporate observations that are not in standard form. Continuing with our example, we can import the observations, or StackOverFlow questions, into PredictionIO's Event Server as events with the following properties:

properties = {question : String, topic : String}

The value question is the actual question stored as a String, and topic is also a string equal to either "Scala" or "Other". Our outcome here is topic, and question will provide a source for extracting features. That is, we will be using question to predict the outcome topic.

Once the observations are loaded as events into the Event Server, the engine's Data Source component is able to read them, which allows you to treat them as objects in a Scala project. The engine's Preparator component is in charge of converting these observations into standard form. To do this, we can first map the topic values as follows:

Map("Other" -> 0, "Scala" -> 1).

We can then vectorize the observation's associated question text to obtain a numeric feature vector for each of our observations. This text vectorization procedure is an example of a general concept in machine learning called feature extraction. After performing these transformations of our observations, they are now in standard form and can be used for training a large quantity of machine learning models.

Training the Model With PredictionIO

The Algorithm engine component serves two purposes: outputting a predictive model \(f\) and using this to predict the outcome variable. Here \(f\) takes as input a vectorized question and outputs either 0 or 1. However, our Query input will be again a question, and our PredictedResult the topic associated to the predicted label (0 or 1):

Query = {question : String} PredictedResult = {topic : String}

With PredictionIO's engine development platform, you can easily automate the vectorization of the Query question, as well as mapping the predicted label to the appropriate topic output format.

This guide is designed to give developers a brief introduction to fundamental concepts in machine learning, as well as an explanation of how these concept tie into PredictionIO's engine development platform. This particular guide will largely deal with giving some

Introduction to Supervised Learning

The first question we must ask is: what is machine learning? Machine learning is the field of study at the intersection of computer science, engineering, mathematics, and statistics which seeks to discover or infer patterns hidden within a set of observations, which we call our data. Some examples of problems that machine learning seeks to solve are:

  • Predict whether a patient has breast cancer based on their mammogram results.
  • Predict whether an e-mail is spam or not based on the e-mail's content.
  • Predict today's temperature based on climate variables collected for the previous week.

Thinking About Data

In the latter examples, we are trying to predict an outcome \(Y\), or response, based on some recorded or observed variables \(X\), or features. For example: in the third problem each observation is a patient, the response variable \(Y\) is equal to 1 if this patient has breast cancer and 0 otherwise, and \(X\) represents the mammogram resul ts.

When we say we want to predict \(Y\) using \(X\), we are trying to answer the question: how does a response \(Y\) depend on a set of features \(X\) affect the response \(Y\)? To do this we need a set of observations, which we call our training data, consisting of observations for which we have observed both \(Y\) and \(X\), in order to make inference about this relationship.

Different Types of Supervised Learning Problems

Note that in the first two examples, the outcome \(Y\) can only take on two values (1 : cancer/spam, 0: no cancer/ no spam). Whenever the outcome variable \(Y\) denotes a label associated to a particular group of observations (i.e. cancer group), the supervised learning problem is also called a classification problem. In the third example, however, \(Y\) can take on any numerical value since it denotes some temperatu re reading (i.e. 25.143, 25.14233, 32.0). These types of supervised learning problems are also called regression problems.

Training a Predictive Model

A predictive model should be thought of as a function \(f\) that takes as input a set of features, and outputs a predicted outcome (i.e. \(f(X) = Y\)). The phrase training a model simply refers to the process of using the training data to estimate such a function.

PredictionIO and Supervised Learning

Machine learning methods generally assume that our observation responses and features are numeric vectors. We will say that observations in this format are in standard form. However, when you are working with real-life data this will often not be the case. The data will often be formatted in a manner that is specific to the application's needs. As an example, let's suppose our application is StackOverFlow. The data we want to analyze are questions, and we want to predict based on a question's content whether or not it is related to Scala.

Self-check: Is this a classification or regression problem?

Thinking About Data With PredictionIO

PredictionIO's predictive engine development platform allows you to easily incorporate observations that are not in standard form. Continuing with our example, we can import the observations, or StackOverFlow questions, into PredictionIO's Event Server as events with the following properties:

properties = {question : String, topic : String}

The value question is the actual question stored as a String, and topic is also a string equal to either "Scala" or "Other". Our outcome here is topic, and question will provide a source for extracting features. That is, we will be using question to predict the outcome topic.

Once the observations are loaded as events into the Event Server, the engine's Data Source component is able to read them, which allows you to treat them as objects in a Scala project. The engine's Preparator component is in charge of converting these observations into standard form. To do this, we can first map the topic values as follows:

Map("Other" -> 0, "Scala" -> 1).

We can then vectorize the observation's associated question text to obtain a numeric feature vector for each of our observations. This text vectorization procedure is an example of a general concept in machine learning called feature extraction . After performing these transformations of our observations, they are now in standard form and can be used for training a large quantity of machine learning models.

Training the Model With PredictionIO

The Algorithm engine component serves two purposes: outputting a predictive model \(f\) and using this to predict the outcome variable. Here \(f\) takes as input a vectorized question and outputs either 0 or 1. However, our Query input will be again a question, and our PredictedResult the topic associated to the predicted label (0 or 1):

Query = {question : String} PredictedResult = {topic : String}

With PredictionIO's engine development platform, you can easily automate the vectorization of the Query question, as well as mapping the predicted label to the appropriate topic output format.