Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 51957200BA9 for ; Sun, 9 Oct 2016 01:42:20 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5037C160AF4; Sat, 8 Oct 2016 23:42:20 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C72F4160ADF for ; Sun, 9 Oct 2016 01:42:17 +0200 (CEST) Received: (qmail 78182 invoked by uid 500); 8 Oct 2016 23:42:17 -0000 Mailing-List: contact commits-help@predictionio.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@predictionio.incubator.apache.org Delivered-To: mailing list commits@predictionio.incubator.apache.org Received: (qmail 78173 invoked by uid 99); 8 Oct 2016 23:42:16 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Oct 2016 23:42:16 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 625CE1809D9 for ; Sat, 8 Oct 2016 23:42:16 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -3.52 X-Spam-Level: X-Spam-Status: No, score=-3.52 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, MANY_SPAN_IN_TEXT=2.699, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id WQGQRr6i6kWd for ; Sat, 8 Oct 2016 23:42:01 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with SMTP id 8E38F5FC31 for ; Sat, 8 Oct 2016 23:42:00 +0000 (UTC) Received: (qmail 76991 invoked by uid 99); 8 Oct 2016 23:41:59 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Oct 2016 23:41:59 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 80D62E09D0; Sat, 8 Oct 2016 23:41:59 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: donald@apache.org To: commits@predictionio.incubator.apache.org Date: Sat, 08 Oct 2016 23:42:25 -0000 Message-Id: In-Reply-To: <20e18e20b6ed44d893c7cc9113a2486f@git.apache.org> References: <20e18e20b6ed44d893c7cc9113a2486f@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [28/51] [abbrv] [partial] incubator-predictionio-site git commit: Documentation based on apache/incubator-predictionio#df568b6d505812928b59a662408d90119d524173 archived-at: Sat, 08 Oct 2016 23:42:20 -0000 http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/64c98d37/machinelearning/modelingworkflow/index.html ---------------------------------------------------------------------- diff --git a/machinelearning/modelingworkflow/index.html b/machinelearning/modelingworkflow/index.html new file mode 100644 index 0000000..ad63e05 --- /dev/null +++ b/machinelearning/modelingworkflow/index.html @@ -0,0 +1,6 @@ +Modeling Workflow and DASE

In addition to the DASE components, we also introduce the Data Model and Training Model abstractions. The Data Model abstraction refers to the set of Scala classes dealing with the implementation of modeling choices relating to feature extraction, preparation, and/or selection. For this illustration, this only includes the vectorization of text and t.f.-i.d.f. processing which is entirely implemented in the PreparedData class. The Training Model abstraction refers to any set of classes that individually take in a set of feature observations and output a predictive model. This predictive model is leveraged by the Algorithm component to produce predict ion results to queries in real-time. In the engine template, this abstraction is implemented in the NBModel class. Please note that these are conceptual abstractions that are designed to make engine development easier by decoupling class functionality. Keeping these abstractions in mind will help you in the future with debugging your code, and also make it easier to incorporate different modeling ideas into your engine.

The figure below shows a graphical representation of the engine architecture just described, as well as its interactions with your web/app and a provided Event Server:

Engine Overview

Training The Model

This section will guide you through the two Training Model implementations that come with this engine template. Recall that the Training Model abstraction refers to an arbitrary set Scala C lass that outputs a predictive model (i.e. implements some method that can be used for prediction). The general problem this engine template is tackling is text classification, so that our Training Model abstraction domain is restricted to implementations producing classifiers. In particular, the classification model that is implemented in this engine template is based on Multinomial Naive Bayes using t.f.-i.d.f. vectorized text.

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/64c98d37/machinelearning/modelingworkflow/index.html.gz ---------------------------------------------------------------------- diff --git a/machinelearning/modelingworkflow/index.html.gz b/machinelearning/modelingworkflow/index.html.gz new file mode 100644 index 0000000..190f1fb Binary files /dev/null and b/machinelearning/modelingworkflow/index.html.gz differ http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/64c98d37/production/deploy-cloudformation/index.html ---------------------------------------------------------------------- diff --git a/production/deploy-cloudformation/index.html b/production/deploy-cloudformation/index.html new file mode 100644 index 0000000..98741f7 --- /dev/null +++ b/production/deploy-cloudformation/index.html @@ -0,0 +1,6 @@ +Deploying with AWS CloudFormation

This document has been moved to here.

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/64c98d37/production/deploy-cloudformation/index.html.gz ---------------------------------------------------------------------- diff --git a/production/deploy-cloudformation/index.html.gz b/production/deploy-cloudformation/index.html.gz new file mode 100644 index 0000000..56d27b9 Binary files /dev/null and b/production/deploy-cloudformation/index.html.gz differ http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/64c98d37/resources/faq/index.html ---------------------------------------------------------------------- diff --git a/resources/faq/index.html b/resources/faq/index.html new file mode 100644 index 0000000..c507709 --- /dev/null +++ b/resources/faq/index.html @@ -0,0 +1,123 @@ +Frequently Asked Questions

If you have questions that are not res olved below, you can subscribe and post to the user mailing list. You can follow the instructions here.

Using PredictionIO

Q: How do I check to see if various dependencies, such as ElasticSearch and HBase, are running?

You can run $ pio status from the terminal and it will return the status of various components that PredictionIO depends on.

  • You should see the following message if everything is OK:
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
$ pio status
+PredictionIO
+  Installed at: /home/vagrant/PredictionIO
+  Version: 0.8.6
+
+Apache Spark
+  Installed at: /home/vagrant/PredictionIO/vendors/spark-1.2.0
+  Version: 1.2.0 (meets minimum requirement of 1.2.0)
+
+Storage Backend Connections
+  Verifying Meta Data Backend
+  Verifying Model Data Backend
+  Verifying Event Data Backend
+  Test write Event Store (App Id 0)
+2015-02-03 18:52:38,904 INFO  hbase.HBLEvents - The table predictionio_eventdata:events_0 doesn't exist yet. Creating now...
+2015-02-03 18:52:39,868 INFO  hbase.HBLEvents - Removing table predictionio_eventdata:events_0...
+
+(sleeping 5 seconds for all messages to show up...)
+Your system is all ready to go.
+
  • If you see the following error message, it usually means ElasticSearch is not running properly:
1
+2
+3
+4
+5
+6
+7
+8
+9
  ...
+Storage Backend Connections
+  Verifying Meta Data Backend
+  ...
+Caused by: org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: []
+    at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:298)
+  ...
+
+Unable to connect to all storage backend(s) successfully. Please refer to error message(s) above. Aborting.
+

You can check if there is any ElasticSearch process by running 'jps'.

Please see How to start elasticsearch below.

  • If you see the following error message, it usually means HBase is not running properly:
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
Storage Backend Connections
+  Verifying Meta Data Backend
+  Verifying Model Data Backend
+  Verifying Event Data Backend
+2015-02-03 18:40:04,810 ERROR zookeeper.RecoverableZooKeeper - ZooKeeper exists failed after 1 attempts
+2015-02-03 18:40:04,812 ERROR zookeeper.ZooKeeperWatcher - hconnection-0x1e4075ce, quorum=localhost:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
+org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
+...
+2015-02-03 18:40:07,021 ERROR hbase.StorageClient - Failed to connect to HBase. Plase check if HBase is running properly.
+2015-02-03 18:40:07,026 ERROR storage.Storage$ - Error initializing storage client for source HBASE
+2015-02-03 18:40:07,027 ERROR storage.Storage$ - Can't connect to ZooKeeper
+java.util.NoSuchElementException: None.get
+...
+
+Unable to connect to all storage backend(s) successfully. Please refer to error message(s) above. Aborting.
+

You can check if there is any HBase-related process by running 'jps'.

Please see How to start HBase below.

Q: How to start ElasticSearch?

If you used the install script to install PredictionIO, the ElasticSearch is installed at ~/PredictionIO/vendors/elasticsearch-x.y.z/ where x.y.z is the version number (currently it's 1.4.4). To start it, run:

1
$ ~/PredictionIO/vendors/elasticsearch-x.y.z/bin/elasticsearch
+

If you didn't use install script, please go to where ElasticSearch is installed to start it.

It may take some time (15 seconds or so) for ElasticSearch to become ready after you start it (wait a bit before you run pio status again).

Q: How to start HBase ?

If you used the install script to install PredictionIO, the HBase is installed at ~/PredictionIO/vendors/hbase-x.y.z/ where x.y.z is the version number (currently it's 0.98.6). To start it, run:

1
$ ~/PredictionIO/vendors/hbase-x.y.z/bin/start-hbase.sh
+

If you didn't use install script, please go to where HBase is installed to start it.

It may take some time (15 seconds or so) for HBase to become ready after you start it (wait a bit before you run pio status again).

Problem with Event Server

Q: How do I increase the JVM heap size of the Event Server?

Add the JAVA_OPTS environmental variable to supply JVM options, e.g.

1
$ JAVA_OPTS=-Xmx16g bin/pio eventserver ...
+

Engine Training

Q: How to increase Spark driver program and worker executor memory size?

In general, the PredictionIO bin/pio scripts wraps around Spark's spark-submit script. You can specify a lot of Spark configurations (i.e. executor memory, cores, master url, etc.) with it. You can supply these as pass-through arguments at the end of bin/pio command.

If the engine training seems stuck, it's possible that the the executor doesn't have enough memory.

First, follow instruction here to start standalone Spark cluster and get the master URL. If you use the provided quick install script to install PredictionIO, the Spark is installed at Prediction IO/vendors/spark-1.2.0/ where you could run the Spark commands in sbin/ as described in the Spark documentation. Then use following train commmand to specify executor memory (default is only 512 MB) and driver memory.

For example, the follow command set the Spark master to spark://localhost:7077 (the default url of standalone cluster), set the driver memory to 16G and set the executor memory to 24G for pio train.

1
$ pio train -- --master spark://localhost:7077 --driver-memory 16G --executor-memory 24G
+

Q: How to resolve "Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 165:35 was 110539813 bytes, which exceeds max allowed: spark.akka.frameSize (10485760 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values."?

A likely reason is the local algorithm model is larger than the default frame size. You can specify a larger value as a pass-thru argument to spark-submit when you pio train. The followin g command increase the frameSize to 1024MB.

1
$ pio train -- --conf spark.akka.frameSize=1024
+

Deploy Engine

Q: How to increase heap space memory for "pio deploy"?

If you see the following error during pio deploy, it means there is not enough heap space memory.

1
+2
+3
+4
...
+[ERROR] [LocalFSModels] Java heap space
+[ERROR] [OneForOneStrategy] None.get
+...
+

To increase the heap space, specify the "-- --driver-memory " parameter in the command. For example, set the driver memory to 8G when deploy the engine:

1
$ pio deploy -- --driver-memory 8G
+

Building PredictionIO

Q: How to resolve "Error: Could not find or load main class org.apache.predictionio.tools.Console" after ./make_distribution.sh?

1
+2
$ bin/pio app
+Error: Could not find or load main class org.apache.predictionio.tools.Console
+

When PredictionIO bumps a version, it creates another JAR file with the new version number.

Delete everything but the latest pio-assembly-<VERSION>.jar in $PIO_HOME/assembly directory. For example:

1
+2
+3
+4
+5
+6
+7
+8
+9
PredictionIO$ cd assembly/
+PredictionIO/assembly$ ls -al
+total 197776
+drwxr-xr-x  2 yipjustin yipjustin      4096 Nov 12 00:08 .
+drwxr-xr-x 17 yipjustin yipjustin      4096 Nov 12 00:09 ..
+-rw-r--r--  1 yipjustin yipjustin 101184982 Nov  5 06:05 pio-assembly-0.8.1-SNAPSHOT.jar
+-rw-r--r--  1 yipjustin yipjustin 101324859 Nov 12 00:09 pio-assembly-0.8.2.jar
+
+PredictionIO/assembly$ rm pio-assembly-0.8.1-SNAPSHOT.jar
+

Q: How to resolve ".......error java.lang.AssertionError: assertion failed: java.lang.AutoCloseable" when ./make_distribution.sh?

PredictionIO only support Java 7 or later. Please make sure you have the correct Java version with the command:

1
$ javac -version
+

Engine Development

Q: What's the difference between P- and L- prefixed classes and functions?

PredictionIO v0.8 is built on the top of Spark, a massively scalable programming framework. A spark algorithm is different from conventional single machine algorithm in a way that spark algorithms use the RDD abstraction as its primary data type.

PredictionIO framework natively support both RDD-based algorithms and traditional single-machine algorithms. For controllers prefixed by "P" (i.e. PJavaDataSource, PJavaAlgorithm), their data include RDD abstraction; For "L" controllers, they are traditional single machine algorithms.

Running HBase

Q: How to resolve 'Exception in thread "main" java.lang.NullPointerException at org.apache.hadoop.net.DNS.reverseDns(DNS.java:92)'?

HBase relies on reverse DNS be set up properly to function. If your network configuration changes (such as working on a laptop with public WiFi hotspots), there could be a chance that reverse DNS does not function properly. You can install a DNS server on your own computer. Some users have reported that using Google Public DNS would also solve the problem.

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/64c98d37/resources/faq/index.html.gz ---------------------------------------------------------------------- diff --git a/resources/faq/index.html.gz b/resources/faq/index.html.gz new file mode 100644 index 0000000..d2dc738 Binary files /dev/null and b/resources/faq/index.html.gz differ http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/64c98d37/resources/glossary/index.html ---------------------------------------------------------------------- diff --git a/resources/glossary/index.html b/resources/glossary/index.html new file mode 100644 index 0000000..e0d62cf --- /dev/null +++ b/resources/glossary/index.html @@ -0,0 +1,6 @@ +Glossary

Data Preparator - Part of Engine. It reads data from source and transforms it to the desired format.

Data Source - Part of Engine. It preprocesses the data and forward it to the algorithm for model training.

Engine - An Engine represents a type of prediction, e.g. product recommendation. It is comprised of four components: [D] Data Source and Data Preparator, [A] Algorithm, [S] Serving, [E] Evaluation Metrics.

EngineClient - Part of PredictionSDK. It sends queries to a deployed engine instance through the Engine API and retrives prediction results.

Event API - Please see Event Server.

Event Server - Event Server is designed to collect data into PredictionIO in an event-based style. Once the Event Server is launched, your application can send data to it through its Event API with HTTP requests or with the EventClient of PredictionIO's SDKs.

EventClient - Please see Event Server.

Live Evaluation - Evaluation of prediction results in a production environment. Prediction results are shown to real users. Users do not rate the results explicitly but the system observes user behaviors such as click through rate.

Offline Evaluation - The prediction results are compared with pre-compiled offline datasets. Typically, offline evaluations are meant to identify the most promising approaches.

Test Data - Also commonly referred as Test Set. A set of data used to assess the strength and utility of a predictive relationship.

Training Data - Also commonly referred as Training Set. A set of data used to discover potentially predictive relationships. In PredictionIO Engine, training data is processed through the Data layer and passed onto algorithm.

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/64c98d37/resources/glossary/index.html.gz ---------------------------------------------------------------------- diff --git a/resources/glossary/index.html.gz b/resources/glossary/index.html.gz new file mode 100644 index 0000000..2f4f3c3 Binary files /dev/null and b/resources/glossary/index.html.gz differ