Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 13C6C200C16 for ; Thu, 9 Feb 2017 18:35:10 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 122B9160B64; Thu, 9 Feb 2017 17:35:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 40615160B50 for ; Thu, 9 Feb 2017 18:35:07 +0100 (CET) Received: (qmail 37049 invoked by uid 500); 9 Feb 2017 17:35:06 -0000 Mailing-List: contact commits-help@predictionio.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@predictionio.incubator.apache.org Delivered-To: mailing list commits@predictionio.incubator.apache.org Received: (qmail 37040 invoked by uid 99); 9 Feb 2017 17:35:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Feb 2017 17:35:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id BAD9D1A04F1 for ; Thu, 9 Feb 2017 17:35:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -6.217 X-Spam-Level: X-Spam-Status: No, score=-6.217 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, MANY_SPAN_IN_TEXT=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 4CNaDSiqFLS3 for ; Thu, 9 Feb 2017 17:34:51 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id 585855FE69 for ; Thu, 9 Feb 2017 17:34:36 +0000 (UTC) Received: (qmail 26853 invoked by uid 99); 9 Feb 2017 17:33:20 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Feb 2017 17:33:20 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id C17C7DFF1F; Thu, 9 Feb 2017 17:33:20 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: chanlee@apache.org To: commits@predictionio.incubator.apache.org Date: Thu, 09 Feb 2017 17:34:04 -0000 Message-Id: <68e176629caf452db01e3dc9c2952ec6@git.apache.org> In-Reply-To: References: X-Mailer: ASF-Git Admin Mailer Subject: [46/51] [abbrv] [partial] incubator-predictionio-site git commit: Documentation based on apache/incubator-predictionio#d674b89c7c3a17437bd406a497a08773c24c8007 archived-at: Thu, 09 Feb 2017 17:35:10 -0000 http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/7177903a/customize/troubleshooting/index.html ---------------------------------------------------------------------- diff --git a/customize/troubleshooting/index.html b/customize/troubleshooting/index.html new file mode 100644 index 0000000..65baf73 --- /dev/null +++ b/customize/troubleshooting/index.html @@ -0,0 +1,85 @@ +Engine Development - Troubleshoot

Apache PredictionIO (incubating) provides the following features to help y ou debug engines during development cycle.

Stop Training between Stages

By default pio train runs through the whole training process including DataSource, Preparator and Algorithm. To speed up the development and debug cycle, you can stop the process after each stage to verify it has completed correctly.

If you have modified DataSource and want to confirm the TrainingData is generated as expected, you can run pio train with --stop-after-read option:

1
pio train --stop-after-read
+

This would stop the training process after the TrainingData is generated.

For example, if you are running Recommendation Template, you should see the the training process stops after the TrainingData is printed.

1
+2
+3
+4
[INFO] [CoreWorkflow$] TrainingData:
+[INFO] [CoreWorkflow$] ratings: [1501] (List(Rating(3,0,4.0), Rating(3,1,4.0))...)
+...
+[INFO] [CoreWorkflow$] Training interrupted by org.apache.predictionio.workflow.StopAfterReadInterruption.
+

Similarly, you can stop the training after the Preparator phase by using --stop-after-prepare option and it would stop after PreparedData is generated:

1
pio train --stop-after-prepare
+

Sanity Check

You can extend a trait SanityCheck and implement the method sanityCheck() with your error checking code. The sanityCheck() is called when the data is generated. This can be applied to TrainingData, PreparedData and the Model classes, which are outputs of DataSource's readTraining(), Preparator's prepare() and Algorithm's train() methods, respectively.

For example, one frequent error with the Recommendation Template is that the TrainingData is empty because the DataSource is not reading data correctly. You can add the check of empty data inside the sanityCheck() function. You can easily add other checking logic into the sanityCheck() function based on your own needs. Also, If you implement toString() method in your TrainingData. You can call toString() inside sanityCheck() to print out some data for visual checking.

For example, to print TrainingData to console and check if the ratings is empty, you can do the following:

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
import org.apache.predictionio.controller.SanityCheck // ADDED
+
+class TrainingData(
+  val ratings: RDD[Rating]
+) extends Serializable with SanityCheck { // EXTEND SanityCheck
+  override def toString = {
+    s"ratings: [${ratings.count()}] (${ratings.take(2).toList}...)"
+  }
+
+  // IMPLEMENT sanityCheck()
+  override def sanityCheck(): Unit = {
+    println(toString())
+    // add your other checking here
+    require(!ratings.take(1).isEmpty, s"ratings cannot be empty!")
+  }
+}
+

You may also use together with --stop-after-read flag to debug the DataSource:

1
+2
pio build
+pio train --stop-after-read
+

If your data is empty, you should see the following error thrown by the sanityCheck() function:

1
+2
+3
+4
+5
+6
+7
+8
+9
[INFO] [CoreWorkflow$] Performing data sanity check on training data.
+[INFO] [CoreWorkflow$] org.template.recommendation.TrainingData supports data sanity check. Performing check.
+Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: ratings cannot be empty!
+    at scala.Predef$.require(Predef.scala:233)
+    at org.template.recommendation.TrainingData.sanityCheck(DataSource.scala:73)
+    at org.apache.predictionio.workflow.CoreWorkflow$$anonfun$runTypelessContext$7.apply(Workflow.scala:474)
+    at org.apache.predictionio.workflow.CoreWorkflow$$anonfun$runTypelessContext$7.apply(Workflow.scala:465)
+    at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
+  ...
+

You can specify the --skip-sanity-check option to turn off sanityCheck:

1
pio train --stop-after-read --skip-sanity-check
+

You should see the checking is skipped such as the following output:

1
+2
+3
+4
[INFO] [CoreWorkflow$] Data sanity checking is off.
+[INFO] [CoreWorkflow$] Data Source
+...
+[INFO] [CoreWorkflow$] Training interrupted by org.apache.predictionio.workflow.StopAfterReadInterruption.
+

Engine Status Page

After run pio deploy, you can access the engine status page by go to same URL and port of the deployed engine with your browser, which is "http://localhost:8000" by default. In the engine status page, you can find the Engine information, and parameters of each DASE components. In particular, you can also see the "Model" trained by the algorithm based on how toString() method is implemented in the Algorithm's Model class.

pio-shell

Apache PredictionIO (incubating) also provides pio-shell in which you can easily access Apache PredictionIO (incubating) API, Spark context and Spark API for quickly testing code or debugging purposes.

To bring up the shell, simply run:

1
$ pio-shell --with-spark
+

(pio-shell is available inside bin/ directory of installed Apache PredictionIO (incubating) directory, you should be able to access it if you have added PredictionIO/bin into your environment variable PATH)

Note that the Spark context is available as variable sc inside the shell.

For example, to get the events of MyApp1 using PEventStore API inside the pio-shell and collect them into an array c. run the following in the shell:

1
+2
+3
> import org.apache.predictionio.data.store.PEventStore
+> val eventsRDD = PEventStore.find(appName="MyApp1")(sc)
+> val c = eventsRDD.collect()
+

Then you should see following returned in the shell:

1
+2
+3
...
+15/05/18 14:24:42 INFO DAGScheduler: Job 0 finished: collect at <console>:24, took 1.850779 s
+c: Array[org.apache.predictionio.data.storage.Event] = Array(Event(id=Some(AaQUUBsFZxteRpDV_7fDGQAAAU1ZfRW1tX9LSWdZSb0),event=$set,eType=item,eId=i42,tType=None,tId=None,p=DataMap(Map(categories -> JArray(List(JString(c2), JString(c1), JString(c6), JString(c3))))),t=2015-05-15T21:31:19.349Z,tags=List(),pKey=
 None,ct=2015-05-15T21:31:19.354Z), Event(id=Some(DjvP3Dnci9F4CWmiqoLabQAAAU1ZfROaqdRYO-pZ_no),event=$set,eType=user,eId=u9,tType=None,tId=None,p=DataMap(Map()),t=2015-05-15T21:31:18.810Z,tags=List(),pKey=None,ct=2015-05-15T21:31:18.817Z), Event(id=Some(DjvP3Dnci9F4CWmiqoLabQAAAU1ZfRq7tsanlemwmZQ),event=view,eType=user,
 eId=u9,tType=Some(item),tId=Some(i25),p=DataMap(Map()),t=2015-05-15T21:31:20.635Z,tags=List(),pKey=None,ct=2015-05-15T21:31:20.639Z), Event(id=Some(DjvP3Dnci9F4CWmiqoLabQAAAU1ZfR...
+
\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/7177903a/customize/troubleshooting/index.html.gz ---------------------------------------------------------------------- diff --git a/customize/troubleshooting/index.html.gz b/customize/troubleshooting/index.html.gz new file mode 100644 index 0000000..3bc4a7c Binary files /dev/null and b/customize/troubleshooting/index.html.gz differ http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/7177903a/datacollection/analytics-ipynb/index.html ---------------------------------------------------------------------- diff --git a/datacollection/analytics-ipynb/index.html b/datacollection/analytics-ipynb/index.html new file mode 100644 index 0000000..3adfffa --- /dev/null +++ b/datacollection/analytics-ipynb/index.html @@ -0,0 +1,87 @@ +Machine Learning Analytics with IPython Notebook

IPython Notebook is a very powerful interactive computational environment, and with Apache PredictionIO (incubating), PySpark and Spark SQL, you can easily analyze your collected events when you are developing or tuning your engine.

Prerequisites

Before you begin, please make sure you have the latest stable IPython installed, and that the command ipython can be accessed from your shell's search path.

Export Events to Apache Parquet

PredictionIO supports exporting your events to Apache Parquet, a columnar storage format that allows you to query quickly.

Let's export the data we imported in Recommendation Engine Template Quick Start, and assume the App ID is 1.

1
$ $PIO_HOME/bin/pio export --appid 1 --output /tmp/movies --format parquet
+

After the command has finished successfully, you should see something similar to the following.

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
root
+ |-- creationTime: string (nullable = true)
+ |-- entityId: string (nullable = true)
+ |-- entityType: string (nullable = true)
+ |-- event: string (nullable = true)
+ |-- eventId: string (nullable = true)
+ |-- eventTime: string (nullable = true)
+ |-- properties: struct (nullable = true)
+ |    |-- rating: double (nullable = true)
+ |-- targetEntityId: string (nullable = true)
+ |-- targetEntityType: string (nullable = true)
+

Preparing IPython Notebook

Launch IPython Notebook with PySpark using the following command, with $SPARK_HOME replaced by the location of Apache Spark.

1
$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --pylab inline" $SPARK_HOME/bin/pyspark
+

By default, you should be able to access your IPython Notebook via web browser at http://localhost:8888.

Let's initialize our notebook for the following code in the first cell.

1
+2
+3
+4
+5
+6
+7
import pandas as pd
+def rows_to_df(rows):
+    return pd.DataFrame(map(lambda e: e.asDict(), rows))
+from pyspark.sql import SQLContext
+sqlc = SQLContext(sc)
+rdd = sqlc.parquetFile("/tmp/movies")
+rdd.registerTempTable("events")
+

Initialization for IPython Notebook

rows_to_df(rows) will come in handy when we want to dump the results from Spark SQL using IPython Notebook's native table rendering.

Performing Analysis with Spark SQL

If all steps above ran successfully, you should have a ready-to-use analytics environment by now. Let's try a few examples to see if everything is functional.

In the second cell, put in this piece of code and run it.

1
+2
+3
+4
+5
summary = sqlc.sql("SELECT "
+                   "entityType, event, targetEntityType, COUNT(*) AS c "
+                   "FROM events "
+                   "GROUP BY entityType, event, targetEntityType").collect()
+rows_to_df(summary)
+

You should see the following screen.

Summary of Events

We can also plot our data, in the next two cells.

1
+2
+3
+4
+5
+6
+7
import matplotlib.pyplot as plt
+count = map(lambda e: e.c, summary)
+event = map(lambda e: "%s (%d)" % (e.event, e.c), summary)
+colors = ['gold', 'lightskyblue']
+plt.pie(count, labels=event, colors=colors, startangle=90, autopct="%1.1f%%")
+plt.axis('equal')
+plt.show()
+

Summary in Pie Chart

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
ratings = sqlc.sql("SELECT properties.rating AS r, COUNT(*) AS c "
+                   "FROM events "
+                   "WHERE properties.rating IS NOT NULL "
+                   "GROUP BY properties.rating "
+                   "ORDER BY r").collect()
+count = map(lambda e: e.c, ratings)
+rating = map(lambda e: "%s (%d)" % (e.r, e.c), ratings)
+colors = ['yellowgreen', 'plum', 'gold', 'lightskyblue', 'lightcoral']
+plt.pie(count, labels=rating, colors=colors, startangle=90,
+        autopct="%1.1f%%")
+plt.axis('equal')
+plt.show()
+

Breakdown of Ratings

Happy analyzing!

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/7177903a/datacollection/analytics-ipynb/index.html.gz ---------------------------------------------------------------------- diff --git a/datacollection/analytics-ipynb/index.html.gz b/datacollection/analytics-ipynb/index.html.gz new file mode 100644 index 0000000..5464b39 Binary files /dev/null and b/datacollection/analytics-ipynb/index.html.gz differ http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/7177903a/datacollection/analytics-tableau/index.html ---------------------------------------------------------------------- diff --git a/datacollection/analytics-tableau/index.html b/datacollection/analytics-tableau/index.html new file mode 100644 index 0000000..dd0adea --- /dev/null +++ b/datacollection/analytics-tableau/index.html @@ -0,0 +1,91 @@ +Machine Learning Analytics with Tableau

With Spark SQL, it is possible to connect Tableau to Apache PredictionIO (incubating) Event Server for interactive analysis of event data.

Prerequisites

In this article, we will assume that you have a working HDFS, and that your environmental variable HADOOP_HOME has been properly set. This is essential for Apache Hive to function properly. In addition, HADOOP_CONF_DIR in $PIO_HOME/conf/pio-env.sh must also be properly set for the pio export command to write to HDFS instead of the local filesystem.

Export Events to Apache Parquet

PredictionIO supports exporting your events to Apache Parquet, a columnar storage format that allows you to query quickly.

Let's export the data we imported in Recommendation Engi ne Template Quick Start, and assume the App ID is 1.

1
$ $PIO_HOME/bin/pio export --appid 1 --output /tmp/movies --format parquet
+

After the command has finished successfully, you should see something similar to the following.

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
root
+ |-- creationTime: string (nullable = true)
+ |-- entityId: string (nullable = true)
+ |-- entityType: string (nullable = true)
+ |-- event: string (nullable = true)
+ |-- eventId: string (nullable = true)
+ |-- eventTime: string (nullable = true)
+ |-- properties: struct (nullable = true)
+ |    |-- rating: double (nullable = true)
+ |-- targetEntityId: string (nullable = true)
+ |-- targetEntityType: string (nullable = true)
+

Creating Hive Tables

Before you can use Spark SQL's Thrift JDBC/ODBC Server, you will need to create the table schema in Hive first. Please make sure to replace path_of_hive with the real path.

1
+2
+3
+4
$ cd path_of_hive
+$ bin/hive
+hive> CREATE EXTERNAL TABLE events (event STRING, entityType STRING, entityId STRING, targetEntityType STRING, targetEntityId STRING, properties STRUCT<rating:DOUBLE>) STORED AS parquet LOCATION '/tmp/movies';
+hive> exit;
+

Launch Spark SQL's Thrift JDBC/ODBC Server

Once you have created your Hive tables, create a Hive configuration in your Spark installation. If you have a custom hive-site.xml, simply copy or link it to $SPARK_HOME/conf. Otherwise, Hive would have created a local Derby database, and you will need to let Spark knows about it. Create $SPARK_HOME/conf/hive-site.xml from scratch with the following template.

You must change /opt/apache-hive-0.13.1-bin below to a real Hive path.

1
+2
+3
+4
+5
+6
+7
+8
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
+<configuration>
+  <property>
+    <name>javax.jdo.option.ConnectionURL</name>
+    <value>jdbc:derby:;databaseName=/opt/apache-hive-0.13.1-bin/metastore_db;create=true</value>
+  </property>
+</configuration>
+

Launch Spark SQL's Thift JDBC/ODBC Server by

1
$ $SPARK_HOME/sbin/start-thriftserver.sh
+

You can test the server using the included Beeline client.

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
$ $SPARK_HOME/bin/beeline
+beeline> !connect jdbc:hive2://localhost:10000
+(Use empty username and password when prompted)
+0: jdbc:hive2://localhost:10000> select * from events limit 10;
++--------+-------------+-----------+-------------------+-----------------+------------------+
+| event  | entitytype  | entityid  | targetentitytype  | targetentityid  |    properties    |
++--------+-------------+-----------+-------------------+-----------------+------------------+
+| buy    | user        | 3         | item              | 0               | {"rating":null}  |
+| buy    | user        | 3         | item              | 1               | {"rating":null}  |
+| rate   | user        | 3         | item              | 2               | {"rating":1.0}   |
+| buy    | user        | 3         | item              | 7               | {"rating":null}  |
+| buy    | user        | 3         | item              | 8               | {"rating":null}  |
+| buy    | user        | 3         | item              | 9               | {"rating":null}  |
+| rate   | user        | 3         | item              | 14              | {"rating":1.0}   |
+| buy    | user        | 3         | item              | 15              | {"rating":null}  |
+| buy    | user        | 3         | item              | 16              | {"rating":null}  |
+| buy    | user        | 3         | item              | 18              | {"rating":null}  |
++--------+-------------+-----------+-------------------+-----------------+------------------+
+10 rows selected (0.515 seconds)
+0: jdbc:hive2://localhost:10000>
+

Now you are ready to use Tableau!

Performing Analysis with Tableau

Launch Tableau and Connect to Data. Click on Spark SQL (Beta) and enter Spark SQL's Thrift JDBC/ODBC Server information. Make sure to pick User Name as Authentication. Click Connect.

Tableau and Spark SQL

On the next page, pick default under Schema.

You may not see any choices when you click on Schema. Simply press Enter and Tableau will try to list all schemas.

Once you see a list of tables that includes events, click New Custom SQL, then enter the following.

1
SELECT event, entityType, entityId, targetEntityType, targetEntityId, properties.rating FROM events
+

Click Update Now. You should see the following screen by now, indicating success in loading data. Using a custom SQL allows you to extract arbitrary fields from within properties.

Setting up Tableau

Click Go to Worksheet and start analyzing. The following shows an example of breaking down different rating values.

Rating Values Breakdown

The following shows a summary of interactions.

Interactions

Happy analyzing!

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/7177903a/datacollection/analytics-tableau/index.html.gz ---------------------------------------------------------------------- diff --git a/datacollection/analytics-tableau/index.html.gz b/datacollection/analytics-tableau/index.html.gz new file mode 100644 index 0000000..d6e8fc9 Binary files /dev/null and b/datacollection/analytics-tableau/index.html.gz differ