From user-return-22493-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Thu Dec 15 02:23:49 2016 Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 54EE8193DB for ; Thu, 15 Dec 2016 02:23:49 +0000 (UTC) Received: (qmail 31346 invoked by uid 500); 15 Dec 2016 02:23:48 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 31263 invoked by uid 500); 15 Dec 2016 02:23:47 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 31247 invoked by uid 99); 15 Dec 2016 02:23:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Dec 2016 02:23:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 1626B1800A5 for ; Thu, 15 Dec 2016 02:23:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2 X-Spam-Level: ** X-Spam-Status: No, score=2 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=occamsmachete-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id qt9B8rgt6o-F for ; Thu, 15 Dec 2016 02:23:44 +0000 (UTC) Received: from mail-pg0-f43.google.com (mail-pg0-f43.google.com [74.125.83.43]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 030085FB73 for ; Thu, 15 Dec 2016 02:23:43 +0000 (UTC) Received: by mail-pg0-f43.google.com with SMTP id p66so14399711pga.2 for ; Wed, 14 Dec 2016 18:23:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=occamsmachete-com.20150623.gappssmtp.com; s=20150623; h=from:mime-version:subject:date:references:to:in-reply-to:message-id; bh=a+Qo2PqUWX+hP6s1dUkuc2HqQ7vxRK/mnhHVWsDbMuo=; b=bjwMpmNVAq2PvqEej3rfYPzgPnqT2qLSzIH0noZXXJHNr4rF+lpgnbUhFP+yzTzBY2 +YvNjqBCxY1O/l0Flmb9I3XRlLKui3IzPLbSVzufvVPvz51E+EtwmHuarlQ7+qe4Gim/ z5rosJxSzcrh4axktoOerUy5VIRsS3aWEmVSZGBbIYiyy1ow99uxXzsZP3O6amsQsJTj uu/h/tKXE9Sg1S0xzLPzq5fz0WksUR8QSqb7z7vYPshZe/SvMfxr+5lcGuvl92As5siW Zs0GSSexhT+mE6v0muIZgZ+Fr7Wir0BxVlDaH9FEk3iC2pj4z/ldkozgcmQRMRCNXyd9 AURw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:mime-version:subject:date:references:to :in-reply-to:message-id; bh=a+Qo2PqUWX+hP6s1dUkuc2HqQ7vxRK/mnhHVWsDbMuo=; b=Jho08k8+o3nAkkROZ3wB68I8K5mDfmpLc9wKr2lI9dA7eOPB5W8OyocD7HURMYaK51 Opbj2ZZbEHdss286fKa5u2b/Ukb70cYzD8uUbAK0791DCqw/kg8SikVMYu7S7QFEYfQn RWEUJ4OVvLJXkIXlpK6XR+E9JRURdg1UH0GS235PJ9WZ9HCdVu2rZbhyxpHbM8blDAeK fVBXGImDLjea8l+W6eqTjtLLrWYPyHn75PW+3CFYdiIB3sXf8Xjgd9djOqhK6T5YjXIJ nzs1+ztZYfQ4SQeFIAqYCCBQK7GNKdZ3UmDfW4vZGx24OAbzcUoi5LSVMrAJFFzTPt7Z Ukew== X-Gm-Message-State: AKaTC03N6d+6WFfX+ZVF+jkO6p0EI4C5bs2ZQgYaIJyBqy+cP6rUD9k3BwuZ4he7zHsvJQ== X-Received: by 10.84.129.131 with SMTP id b3mr59526300plb.54.1481768608015; Wed, 14 Dec 2016 18:23:28 -0800 (PST) Received: from [192.168.0.8] (c-24-18-213-211.hsd1.wa.comcast.net. [24.18.213.211]) by smtp.gmail.com with ESMTPSA id a22sm90190800pfg.7.2016.12.14.18.23.26 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 14 Dec 2016 18:23:26 -0800 (PST) From: Pat Ferrel Content-Type: multipart/alternative; boundary="Apple-Mail=_A49643C3-9F44-4173-BFBD-F19E25DDF56E" Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Subject: Re: Question about spark-itemsimilarity Date: Wed, 14 Dec 2016 18:23:25 -0800 References: <616DDFAF-2992-4424-8A9F-966849AA67C0@occamsmachete.com> To: "user@mahout.apache.org" In-Reply-To: Message-Id: <0A103EFF-B9D9-402B-B2A2-12741A7769FE@occamsmachete.com> X-Mailer: Apple Mail (2.3259) --Apple-Mail=_A49643C3-9F44-4173-BFBD-F19E25DDF56E Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Cross-occurrence allows us to ask the question: are 2 events correlated.=20= To use the Ecom example, purchase is the conversion or primary action, a = detail page view might be related but we must test each cross-occurrence = to make sure. I know for a fact that with many ecom datasets it is = impossible to treat these events as the same thing and get anything but = a drop in quality of recommendations (I=E2=80=99ve tested this). People = that use the ALS recommender in Spark=E2=80=99s MLlib sometimes tell you = to weight the view less than the purchase. But this is nonsense (again = I=E2=80=99ve tested this). What is true is that *some* views lead to = purchases and others do not. So treating them all with the same weight = is pure garbage. What CCO does is find the views that seem to lead to purchase. It can = also find category-preferences that lead to certain purchases, as well = as location-preference (triggered by a purchase when logged in from some = location). And so on. Just about anything you know about users or can = phrase as a possible indicator of user taste can be used to get lift in = quality of recommendation.=20 So in your example below purchase history is the conversion action, = likes, and downloads are secondary actions looked at as = cross-occurrences. Note that we don=E2=80=99t need to have the same IDs = for all actions. This is why I mention location above.=20 See this blog post and slide deck for more description of the algo: = http://actionml.com/blog/cco BTW to illustrate how powerful this idea is, I have a client that sells = one item a year on average to a customer. It=E2=80=99s a very big item = and has a lifetime of one year. So using ALS you could only train on the = purchase and if you were gathering a year of data there would be = precious little training data. Also when you have a user with no = purchase it is impossible to recommend. ALS fails on all users with no = purchase history. However with CCO, all the user journey and any data = about the user you can gather along the way can be used to recommend = something to purchase. So this client would be able to recommend to only = 20% of their returning shoppers with ALS and those recs would be low of = quality based on only one event far in the past. CCO using all the = clickstream (or important parts of it) can do quite well. This may seem an edge case but only in degree, every ecom app has data = they are throwing away and CCO addresses this. On Dec 13, 2016, at 7:04 AM, Niklas Ekvall = wrote: Thanks Pat for that information! I was meant to handle number of clicks or number of downloads and not rating. But this is not a problem if the Spark doesn't handle values, I have other algorithms who can handle that. How ever, I am quite curios about the occurrences, cooccurrences, and cross-occurrences concept. Can the following be a way to handle different data types? - occurrences - purchase history - cooccurrences - purchase history/likes - cross-occurrences - purchase history/clicks or downloads Best, Niklas 2016-12-01 18:47 GMT+01:00 Pat Ferrel : > No you can=E2=80=99t, the value is ignored. The algorithm looks at = occurrences, > cooccurrences, and cross-occurrences of several event types not values > attached to events. >=20 > If you are trying to use rating info, this has been pretty much = discarded > as being not very useful. For instance you may like comedy movies but = they > always get lower ratings than drama (raters bias) so using ratings to > recommend items is highly problematic, but if a user watched a movie, = that > is a good indicator that they liked it and that is a boolean value. = With > cross-occurrence you can also use dislike as an indicator of = preference but > this is also boolean=E2=80=94a thumbs down. >=20 > To see an end-to-end recommender with all the necessary surrounding > infrastructure check the Apache-PredictionIO project and the Universal > Recommender, which uses the code behind spark-itemsimilarity to serve > recommendations. Read about the UR here: http://actionml.com/docs/ur < > http://actionml.com/docs/ur> >=20 > On Nov 30, 2016, at 6:58 AM, Niklas Ekvall > wrote: >=20 > I found that you can, so ignore my question! >=20 > Best reagrds, Niklas >=20 > 2016-11-30 15:42 GMT+01:00 Niklas Ekvall : >=20 >> Hello! >>=20 >> I'm using *spark-itemsimilarity *to produce related recommendations = and >> the input data has the form *userID, itemID. *Could I also use the = from > *userID, >> itemID, value* (value > 0)? Or does *spark-itemsimilarity* only = handles >> binary values? >>=20 >> Best regards, Niklas >>=20 >=20 >=20 --Apple-Mail=_A49643C3-9F44-4173-BFBD-F19E25DDF56E--