Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B87FF200D20 for ; Tue, 3 Oct 2017 00:01:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B70E21609C0; Mon, 2 Oct 2017 22:01:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B23B8160BD5 for ; Tue, 3 Oct 2017 00:01:09 +0200 (CEST) Received: (qmail 61987 invoked by uid 500); 2 Oct 2017 22:01:08 -0000 Mailing-List: contact commits-help@predictionio.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@predictionio.incubator.apache.org Delivered-To: mailing list commits@predictionio.incubator.apache.org Received: (qmail 61934 invoked by uid 99); 2 Oct 2017 22:01:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Oct 2017 22:01:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C8E60DA76D for ; Mon, 2 Oct 2017 22:01:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2.712 X-Spam-Level: X-Spam-Status: No, score=-2.712 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_NUMSUBJECT=0.5, MANY_SPAN_IN_TEXT=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, T_FILL_THIS_FORM_SHORT=0.01] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id zPXrXDcFaGsn for ; Mon, 2 Oct 2017 22:00:56 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id 618AE61273 for ; Mon, 2 Oct 2017 22:00:44 +0000 (UTC) Received: (qmail 60304 invoked by uid 99); 2 Oct 2017 22:00:43 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Oct 2017 22:00:43 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 9885FF5820; Mon, 2 Oct 2017 22:00:43 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: git-site-role@apache.org To: commits@predictionio.incubator.apache.org Date: Mon, 02 Oct 2017 22:01:20 -0000 Message-Id: <332175e5aaa5423ebdb979fa848dfcf8@git.apache.org> In-Reply-To: References: X-Mailer: ASF-Git Admin Mailer Subject: [40/51] [abbrv] [partial] incubator-predictionio-site git commit: Documentation based on apache/incubator-predictionio#018ea8e34261f0929ad6d4c669fe80d7520bae16 archived-at: Mon, 02 Oct 2017 22:01:11 -0000 http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/d622bec7/datacollection/eventmodel/index.html ---------------------------------------------------------------------- diff --git a/datacollection/eventmodel/index.html b/datacollection/eventmodel/index.html new file mode 100644 index 0000000..521e56b --- /dev/null +++ b/datacollection/eventmodel/index.html @@ -0,0 +1,295 @@ +Events Modeling

This section explains how to model your application data as events.

Entity: it's the real world object involved in the events. The entity may perform the events, or interact with other entity (which became targetEntity in an event).

For example, your application may have users and some items which the user can interact with. Then you can model them as two entity types: user and item and the entityId can uniquely identify the entity within each entityType (e.g. user with ID 1, item with ID 1).

An entity may peform some events (e.g user 1 does something), and entity may have properties associated with it (e.g. user may have g ender, age, email etc). Hence, events involve entities and there are three types of events, respectively:

  1. Generic events performed by an entity.
  2. Special events for recording changes of an entity's properties
  3. Batch events

They are explained in details below.

1. Generic events performed by an entity

Whenever the entity performs an action, you can describe such event as entity "verb" targetEntity with "some extra information". The "targetEntity" and "some extra information" can be optional. The "verb" can be used as the name of the "event". The "some extra information" can be recorded as properties of the event.

The following are some simple examples:

  • user-1 signs-u p
1
+2
+3
+4
+5
{
+  "event" : "sign-up",
+  "entityType" : "user",
+  "entityId" : "1"
+}
+
  • user-1 views item-1 (with targetEntity)
1
+2
+3
+4
+5
+6
+7
{
+  "event" : "view",
+  "entityType" : "user",
+  "entityId" : "1",
+  "targetEntityType" : "item",
+  "targetEntityId" : "1"
+}
+
  • user-1 rates item-1 with rating of 4 stars (with targetEntity and properties)
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
{
+  "event" : "rate",
+  "entityType" : "user",
+  "entityId" : "1",
+  "targetEntityType" : "item",
+  "targetEntityId" : "1",
+  "properties" : {
+    "rating" : 4
+  }
+}
+

2. Special events for recording changes of an entity's properties

The generic events described above are used to record general actions performed by the entity. However, an entity may have properties (or attributes) associated with it. Morever, the properties of the entity may change over time (for example, user may have new address, item may have new categories). In order to record such changes of an entity's properties. Special events $set , $unset and $delete are introduced.

The following special events are reserved for updating entities and their properties:

  • "$set" event: Set properties of an entity (also implicitly create the entity). To change properties of entity, you simply set the corresponding properties with value again. The $set events should be created only when:
    • The entity is first created (or re-create after $delete event), or
    • Set the entity's existing or new properties to new values (For example, user updates his email, user adds a phone number, item has a updated categories)
  • "$unset" event: Unset properties of an entity. It means treating the specified properties as not existing anymore. Note that the field properties cannot be empty for $unset event.
  • "$delete" event: delete the entity.

There is no targetEntityId for these special events.

For example, setting entity user-1's properties birthday and address:

1
+2
+3
+4
+5
+6
+7
+8
+9
{
+  "event" : "$set",
+  "entityType" : "user",
+  "entityId" : "1",
+  "properties" : {
+    "birthday" : "1984-10-11",
+    "address" : "1234 Street, San Francisco, CA 94107"
+  }
+}
+

Note that the properties values of the entity will be aggregated based on these special events and the eventTime. The state of the entity is different depending on the time you are looking at the data. In engine's DataSource, you can use PEventStore.aggregateProperties() API to retrieve the state of entity's properties (based on time).

Although it doesn't hurt to import duplicated special events for an entity (exactly same properties) into event server (it just means that the entity changes to the same state as before and new duplicated event provides no new information about the user), it could waste storage space.

To demonstrate the concept of these special events, we are going to import a sequence of events and see how it affects the retr ieved entitiy's properties.

Assuming you have created the App (named "MyTestApp") for testing and Event Server is started.

Event 1

For example, on 2014-09-09T..., a user with ID "2" is newly added in your application. Also, this user has properties a = 3 and b = 4. To record such event, we can create a $set event for the user.

for convenience, assign the ACCESS_KEY of your test app to the shell variable ACCESS_KEY and run following curl command to import the event:

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
$ ACCESS_KEY="<YOUR_ACCESS_KEY>"
+
+$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
+-H "Content-Type: application/json" \
+-d '{
+  "event" : "$set",
+  "entityType" : "user",
+  "entityId" : "2",
+  "properties" : {
+    "a" : 3,
+    "b" : 4
+  },
+  "eventTime" : "2014-09-09T16:17:42.937-08:00"
+}'
+

You should see something like the following, meaning the events are imported successfully.

1
+2
+3
+4
+5
+6
+7
HTTP/1.1 201 Created
+Server: spray-can/1.3.2
+Date: Tue, 02 Jun 2015 23:13:58 GMT
+Content-Type: application/json; charset=UTF-8
+Content-Length: 57
+
+{"eventId":"PVjOIP6AJ5PgsiGQW6pgswAAAUhc7EwZpCfSj5bS5yg"}
+

After this eventTime, user-2 is created and has properties of a = 3 and b = 4.

Event 2

Then, on 2014-09-10T..., let's say the user has updated the properties b = 5 and c = 6. To record such propertiy change, create another $set event. Run the following command:

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
+-H "Content-Type: application/json" \
+-d '{
+  "event" : "$set",
+  "entityType" : "user",
+  "entityId" : "2",
+  "properties" : {
+    "b" : 5,
+    "c" : 6
+  },
+  "eventTime" : "2014-09-10T13:12:04.937-08:00"
+}'
+

After this eventTime, user-2 has properties of a = 3, b = 5 and c = 6. Note that property b is updated with latest value.

Event 3

Then, let's say on 2014-09-11T..., the user's properties 'b' is removed for some reasons. To record such event, create $unset event for user-2 with properties b:

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
+-H "Content-Type: application/json" \
+-d '{
+  "event" : "$unset",
+  "entityType" : "user",
+  "entityId" : "2",
+  "properties" : {
+    "b" : null
+  },
+  "eventTime" : "2014-09-11T14:17:42.456-08:00"
+}'
+

After this eventTime, user-2 has properties of a = 3, and c = 6. Note that property b is removed.

Event 4

Then, on 2014-09-12T..., the user is removed from the application data. To record such event, create $delete event:

1
+2
+3
+4
+5
+6
+7
+8
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
+-H "Content-Type: application/json" \
+-d '{
+  "event" : "$delete",
+  "entityType" : "user",
+  "entityId" : "2",
+  "eventTime" : "2014-09-12T16:13:41.452-08:00"
+}'
+

After this eventTime, user-2 is removed.

Event 5

Then, on 2014-09-13T..., let's say we want to add back the user-2 into the application again for some reasons. To record such event, create $set event for user-2 with empty properties:

1
+2
+3
+4
+5
+6
+7
+8
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
+-H "Content-Type: application/json" \
+-d '{
+  "event" : "$set",
+  "entityType" : "user",
+  "entityId" : "2",
+  "eventTime" : "2014-09-13T16:17:42.143-08:00"
+}'
+

After this eventTime, user-2 is created again with empty properties.

Note that all above events are recorded in Event Store. Let's query Event Server and see if these events are imported.

Go to following URL with your browser:

http://localhost:7070/events.json?accessKey=<YOUR_ACCESS_KEY>

or run the following command in terminal:

1
$ curl -i -X GET "http://localhost:7070/events.json?accessKey=$ACCESS_KEY"
+

Note that you should quote the entire URL by using single or double quotes when you run the curl command.

You should see all events being created for this user-2.

Now, let's retrieve the user-2's properties using the PEventStore API.

First, start pio-shell by running:

1
$ pio-shell --with-spark
+

You should see the following output and shell prompt:

1
+2
+3
+4
+5
+6
15/06/02 16:01:54 INFO SparkILoop: Created spark context..
+Spark context available as sc.
+15/06/02 16:01:54 INFO SparkILoop: Created sql context (with Hive support)..
+SQL context available as sqlContext.
+
+scala>
+

Run the following code in PIO shell (Replace "MyTestApp" with your app name):

1
+2
+3
scala> val appName="MyTestApp"
+scala> import org.apache.predictionio.data.store.PEventStore
+scala> PEventStore.aggregateProperties(appName=appName, entityType="user")(sc).collect()
+

This command is using PEventStore to aggregate the user properties as a Map of user Id and the PropertyMap. collect() will return the data as array. You should see the following output at the end, which indicates there is user id 2 with empty properties because that's the state of user 2 with all imported events taken into account.

1
+2
res0: Array[(String, org.apache.predictionio.data.storage.PropertyMap)] =
+Array((2,PropertyMap(Map(), 2014-09-09T16:17:42.937-08:00, 2014-09-13T16:17:42.143-08:00)))
+

Let's say we want to retrieve the state of user 2 properties with only events 1 and event 2 imported. To do that, we can specify the untilTime (aggregate the user properties with events up to the specified time) in the API.

Run the following in the pio-shell. the untilTime is set to DateTime(2014, 9, 11, 0, 0) which is the time right before event 3.

1
+2
scala> import org.joda.time.DateTime
+scala> PEventStore.aggregateProperties(appName=appName, entityType="user", untilTime=Some(new DateTime(2014, 9, 11, 0, 0)))(sc).collect()
+

You should see the following ouptut and the aggregated properties matches what we expected as described earlier (right befor event 3): user-2 has properties of a = 3, b = 5 and c = 6.

1
+2
res2: Array[(String, org.apache.predictionio.data.storage.PropertyMap)] =
+Array((2,PropertyMap(Map(b -> JInt(5), a -> JInt(3), c -> JInt(6)), 2014-09-09T16:17:42.937-08:00, 2014-09-10T13:12:04.937-08:00))
+

As you have seen in the example above, the state of user-2 is different depending on the available events or the time you are looking at the data. Recording events in logging fashioned allows us to re-construct the state the entity according to the time.

3. Batch Events to the EventServer

Using a different REST address on the usual EventServer port, as of PredictionIO 0.9.5 you can send batches of up to 50 events as a time. The format is as described above but the JSON payload is packaged as an array of Event objects.

Response:

  • Status:
    • 200 on success if we can return an array data in the response even when some events fail (e.g. because of ill-format). Client needs to check individual dictionary to verify all events were successfully created.
    • 400 otherwise. Perhaps exceeded 50 events?
  • Data: a n array of dictionaries each of which contains either following keys
    • “status”: 201 if the event was successfully created; otherwise, 400.
    • "eventID": the value is the eventID if the event is successfully created and
    • "message": the error message string if any error occurs during creation

The order in the response array is corresponding to the order of the request array. However, the events might be imported in any order.

Sample Request:

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
curl -i -X POST http://localhost:7070/batch/events.json?accessKey=...
+-H "Content-Type: application/json" -d ‘ \
+[
+    {
+        "event": "$create",
+        "entityType": "user",
+        "entityId": "uid",
+        "properties": {
+            ...
+        }
+    },
+    {
+        "event": "like",
+        "entityType": "user",
+        "entityId": "uid",
+        "targetEntityType": "item",
+        "targetEntityId": "iid",
+        "properties": {
+            ...
+        }
+        "eventTime": "2004-12-13T21:39:45.618-07:00"
+    },
+    ...
+]‘
+

Sample Response:

1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
HTTP/1.1 200 Successful
+Server: spray-can/1.2.1
+Date: Wed, 10 Sep 2014 22:51:33 GMT
+Content-Type: application/json; charset=UTF-8
+Content-Length: 41
+[
+    {"eventId":"AAAABAAAAQDP3-jSlTMGVu0waj8"},
+    {
+        "status": 201,
+        "eventId": "AAAABAAAAQDP3-jSlTMGVu0waj8"
+    },
+    {
+        "status": 201,
+        "eventId":"AAAABAAAAQDP3-jSlTMGVu0waj9"
+    },
+     …
+    {
+        "status": 400,
+        "message":"Required entityType is missing”
+    },
+    …
+]
+

Notice that each subrequest receives a status response. The limit of 50 events per batch requests is in line with Facebook, Mixpanel, SegmentIO and other event syncs that accept batches.

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/d622bec7/datacollection/eventmodel/index.html.gz ---------------------------------------------------------------------- diff --git a/datacollection/eventmodel/index.html.gz b/datacollection/eventmodel/index.html.gz new file mode 100644 index 0000000..0db8114 Binary files /dev/null and b/datacollection/eventmodel/index.html.gz differ http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/d622bec7/datacollection/index.html ---------------------------------------------------------------------- diff --git a/datacollection/index.html b/datacollection/index.html new file mode 100644 index 0000000..099bfe6 --- /dev/null +++ b/datacollection/index.html @@ -0,0 +1,7 @@ +Event Server Overview
< div class="content">

Apache PredictionIO (incubating) offers an Event Server that collects data in an event-based style via a RESTful API. By default, Event Server uses Apache HBase as data store.

EventServer Highlight

What data should I collect?

The Event Server can collect and store arbitrary events. At the beginning of your project, it is recommended to collect as much data as you can. Later on, you can exclude data that are not relevant to your predictive model in Data Preparator.

Recommendation Engine

With Collaborative Filtering based Recommendation Engine, a common pattern is

1
user -- action -- item
+

where users and items have properties associated with them.

For example, for personalized book recommendation, some events to collect would be

  • User 1 purchased product X
  • User 2 viewed product Y
  • User 1 added product Z in the cart

User properties can be gender, age, location, etc. Item properties can be genre, author, and other attributes that may be related to the the user's preference.

Data collection varies quite a bit based on your application and your prediction goal. We are happy to assist you with your questions.

\ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/d622bec7/datacollection/index.html.gz ---------------------------------------------------------------------- diff --git a/datacollection/index.html.gz b/datacollection/index.html.gz new file mode 100644 index 0000000..b07220d Binary files /dev/null and b/datacollection/index.html.gz differ