Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DC9C9107F4 for ; Wed, 9 Oct 2013 21:07:50 +0000 (UTC) Received: (qmail 93707 invoked by uid 500); 9 Oct 2013 21:07:47 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 93608 invoked by uid 500); 9 Oct 2013 21:07:47 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 93600 invoked by uid 99); 9 Oct 2013 21:07:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Oct 2013 21:07:46 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.192.180] (HELO mail-pd0-f180.google.com) (209.85.192.180) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Oct 2013 21:07:42 +0000 Received: by mail-pd0-f180.google.com with SMTP id y10so1529328pdj.25 for ; Wed, 09 Oct 2013 14:07:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to; bh=vubwMuyeEXSH6XO3hfMJjSuHbuYWh9D6X3SHQJXIWtY=; b=TEU2Cwg3CwunqiyHLk08y8xaKZDQHLnb8v7F2fo3mydVwBMqr4F1BwmLzOkg5IdFvA MDSH5ATaNUqpCdjmCO9pR+lgM/ck60VfgP5Nkb+e8ezyIeI1Kry1loM5kpSOZFmGSdUC dCKs0Y3uoUQsNkIa3pZR7EEIRQ1IMQStMiFDVUBrBHVNsPq/dnkwTjNIEM14ZvNciuRM GfeM53q2om0j857doGq9yzQ80a3zP3TwScd28YtuxGdwQ6/xN3m6jT9QegKxJjX4hqFi w3NrF5qMgZZnXMewqOB3K3RID8zCR1YDvojQVvQKuK8PoP4TWuKq5NILkiVuddJJBIIu qKsw== X-Gm-Message-State: ALoCoQnsjPeUQvsfn4RuNLgx1z51seOTyuGWz1yO5ZNPA93seweycN7tQttMGiIJNvFNo0+VsYQR X-Received: by 10.68.203.163 with SMTP id kr3mr9977184pbc.161.1381352841217; Wed, 09 Oct 2013 14:07:21 -0700 (PDT) Received: from [192.168.0.2] (c-98-232-27-186.hsd1.wa.comcast.net. [98.232.27.186]) by mx.google.com with ESMTPSA id tz3sm48609588pbc.20.1969.12.31.16.00.00 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 09 Oct 2013 14:07:19 -0700 (PDT) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\)) Subject: Re: Solr-recommender From: Pat Ferrel In-Reply-To: <5255B477.2080001@safaribooksonline.com> Date: Wed, 9 Oct 2013 14:07:18 -0700 Cc: Ted Dunning Content-Transfer-Encoding: quoted-printable Message-Id: References: <6826931F-A27D-4479-8595-B9A870B1B25C@apache.org> <398A43A9-81DB-473B-8191-B1CBF4DCD114@gmail.com> <7BAE09E6-7FD7-4350-8077-ED994BFC2042@apache.org> <912C6EBA-A3EC-4DEF-AA36-DFBB25578C50@gmail.com> <35C0D736-3A42-4896-BBF4-CC589AE4B96D@occamsmachete.com> <52555692.2000508@safaribooksonline.com> <5255B477.2080001@safaribooksonline.com> To: user@mahout.apache.org X-Mailer: Apple Mail (2.1510) X-Virus-Checked: Checked by ClamAV on apache.org 1) Using the user history for the current user in a more-like-this query = against the item-item similarity matrix will produce a user-history = based recommendation. Simply fetching the item-item history row for a = particular item will give you the item-similarity based recs with no = account of user history. One could imagine a user-user similarity setup, = but that's not what we did. 2) What you are doing is something else that I was calling a = shopping-cart recommender. You are using the item-set in the current = cart and finding similar, what, items? A different way to tackle this is = to store all other shopping carts then use the current cart contents as = a more-like-this query against past carts. This will give you = items-purchased-together by other users. If you have enough carts it = might give even better results. In any case they will be different. https://github.com/pferrel/solr-recommender But if you already have the item-item similarity matrix indexed this = project wont add much. If you have purchase events and view-details = events IDed by user you might try out the cross-recommender part. We've = been searching for a data set to try this on.=20 On Oct 9, 2013, at 12:54 PM, Michael Sokolov = wrote: On 10/9/13 3:08 PM, Pat Ferrel wrote: > Solr uses cosine similarity for it's queries. The implementation on = github uses Mahout LLR for calculating the item-item similarity matrix = but when you do the more-like-this query at runtime Solr uses cosine. = This can be fixed in Solr, not sure how much work. It's not clear to me whether it's worth "fixing" this or not. It would = certainly complicate scoring calculations when mixing with traditional = search terms. >=20 > It sounds like you are doing item-item similarities for = recommendations, not actually calculating user-history based recs, is = that true? Yes that's true so far. Our recommender system has the ability to = provide recs based on user history, but we have not deployed this in our = app yet. My plan was simply to query based on all the items in the = user's "basket" - not sure that this would require a different back end? = We're not at the moment considering user-user similarity measures. >=20 > You bring up a point that we're finding. I'm not so sure we need or = want a recommender query API that is separate from the Solr query API. = What we are doing on our demo site is putting the output of the = Solr-recommender where Solr can index it. Our web app framework then = allows very flexible queries against Solr, using simple user history, = producing the typical user-history based recommendations, or = mixing/boosting based on metadata or contextual data. If we leave the = recommender query API in Solr we get web app framework integration for = free. >=20 > Another point is where the data is stored for the running system. If = we allow Solr to index from any storage service that it supports then we = also get free integration with most any web app framework and storage = service. For the demo site we put the data in a DB and have Solr index = it from there. We also store the user history and metadata there. This = is supported by most web app frameworks out of the box. You could go a = different route and use almost any storage system/file system/content = format since Solr supports a wide variety. >=20 > Given a fully flexible Solr standard query and indexing scheme all you = need do is tweak the query or data source a bit and you have an item-set = recommender (shopping cart) or a contextual recommender (for example = boost recs from a category) or a pure metadata/content based = recommender. >=20 > If the query and storage is left to Solr+web app framework then the = github version is complete if not done. Solr still needs LLR in the = more-like-this queries. Term weights to encode strength scores would = also be nice and I agree that both of these could use some work. I would like to take a look at that version - I may have missed some = discussion about it; would you posting a link please? >=20 > BTW lest we forget this does not imply the Solr-recommender is better = than Myrrix or the Mahout-only recommenders. There needs to be some = careful comparison of results. Michael, did you do offline or A/B tests = during your implementation? I ran some offline tests using our historical data, but I don't have a = lot of faith in these beyond the fact they indicate we didn't make any = obvious implementation errors. We haven't attempted A/B testing yet = since our site is so new, and we need to get a meaningful baseline going = and sort out a lot of other more pressing issues on the site - = recommendations are only one piece, albeit an important one. Actually there was an interesting idea for an article posted recently = about the difficulty of comparing results across systems in this field: = http://www.docear.org/2013/09/23/research-paper-recommender-system-evaluat= ion-a-quantitative-literature-survey/ but that's no excuse not to do = better. I'll certainly share when I know more :) -Mike >=20 > On Oct 9, 2013, at 6:13 AM, Michael Sokolov = wrote: >=20 > Just to add a note of encouragement for the idea of better integration = between Mahout and Solr: >=20 > On safariflow.com, we've recently converted our recommender, which = computes similarity scores w/Mahout, from storing scores and running = queries w/Postgres, to doing all that in Solr. It's been a big = improvement, both in terms of indexing speed, and more importantly, the = flexibility of the queries we can write. I believe that having scoring = built in to the query engine is a key feature for recommendations. More = and more I am coming to believe that recommendation should just be = considered as another facet of search: as one among many variables the = system may take into account when presenting relevant information to the = user. In our system, we still clearly separate search from = recommendations, and we probably will always do that to some extent, but = I think we will start to blend the queries more so that there will be = essentially a continuum of query options including more or less "user = preference" data. >=20 > I think what I'm talking about may be a bit different than what Pat is = describing (in implementation terms), since we do LLR calculations = off-line in Mahout and then bulk load them into Solr. We took one of = Ted's earlier suggestions to heart, and simply ignored the actual = numeric scores: we index the top N similar items for each item. Later = we may incorporate numeric scores in Solr as term weights. If people = are looking for things to do :) I think that would be a great software = contribution that could spur this effort onward since it's difficult to = accomplish right now given the Solr/Lucene indexing interfaces, but is = already supported by the underlying data model and query engine. >=20 >=20 > -Mike >=20 > On 10/2/13 12:19 PM, Pat Ferrel wrote: >> Excellent. =46rom Ellen's description the first Music use may be an = implicit preference based recommender using synthetic data? I'm quickly = discovering how flexible Solr use is in many of these cases. >>=20 >> Here's another use you may have thought of: >>=20 >> Shopping cart recommenders, as goes the intuition, are best modeled = as recommending from similar item-sets. If you store all shopping carts = as your training data (play lists, watch lists etc.) then as a user adds = things to their cart you query for the most similar past carts. Combine = the results intelligently and you'll have an item set recommender. Solr = is built to do this item-set similarity. We tried to do this for a ecom = site with pure Mahout but the similarity calc in real time stymied us. = We knew we'd need Solr but couldn't devote the resources to spin it up. >>=20 >> On the Con-side Solr has a lot of stuff you have to work around. It = also does not have the ideal similarity measure for many uses (cosine is = ok but llr would probably be better). You don't want stop word = filtering, stemming, white space based tokenizing or n-grams. You would = like explicit weighting. A good thing about Solr is how well it = integrates with virtually any doc store independent of the indexing and = query. A bit of an oval peg for a round hole. >>=20 >> It looks like the similarity code is replaceable if not pluggable. = Much of the rest could be trimmed away by config or adherence to = conventions I suspect. In the demo site I'm working on I've had to adopt = some slightly hacky conventions that I'll describe some day. >>=20 >> On Oct 1, 2013, at 10:38 PM, Ted Dunning = wrote: >>=20 >>=20 >> Pat, >>=20 >> Ellen and some folks in Britain have been working with some data I = produced from synthetic music fans. >>=20 >>=20 >> On Tue, Oct 1, 2013 at 2:22 PM, Pat Ferrel = wrote: >> Hi Ellen, >>=20 >>=20 >> On Oct 1, 2013, at 12:38 PM, Ted Dunning = wrote: >>=20 >>=20 >> As requested, >>=20 >> Pat, meet Ellen. >>=20 >> Ellen, meet Pat. >>=20 >>=20 >>=20 >>=20 >> On Tue, Oct 1, 2013 at 8:46 AM, Pat Ferrel = wrote: >> Tunneling (rat-holing?) into the cross-recommender and Solr+Mahout = version. >>=20 >> Things to note: >> 1) The pure Mahout XRecommenderJob needs a cross-LLR or a = cross-similairty job. Currently there is only cooccurrence for = sparsification, which is far from optimal. This might take the form of a = cross RSJ with two DRMs as input. I can't commit to this but would = commit to adding it to the XRecommenderJob. >> 2) output to Solr needs a lot of options implemented and tested. The = hand-run test should be made into some junits. I'm slowly doing this. >> 3) the Solr query API is unimplemented unless someone else is working = on that. I'm building one in a demo site but it looks to me like a = static recommender API is not going to be all that useful and maybe a = document describing how to do it with the Solr query interface would be = best, especially for a first step. The reasoning here is that it is so = tempting to mix in metadata to the recommendation query that a static = API is not so obvious. For the demo site the recommender API will be = prototyped in a bunch of ways using models and controllers in Rails. If = I'm the one to do the a Java Solr-recommender query API it will be after = experimenting a bit. >>=20 >> Can someone introduce me to Ellen and Tim? >>=20 >> On Sep 28, 2013, at 10:59 AM, Ted Dunning = wrote: >>=20 >> The one large-ish feature that I think would find general use would = be a high performance classifier trainer. >>=20 >> Flor cleanup sort of thing it would be good to fully integrate the = streaming k-means into the normal clustering commands while revamping = the command line API. >>=20 >> Dmitriy's recent scala work would help quite a bit before 1.0. Not = sure it can make 0.9. >>=20 >> For recommendations, I think that the demo system that pat started = with the elaborations by Ellen an Tim would be very good to have. >>=20 >> I would be happy to collaborate with somebody on these but am not at = all likely to have time to actually do them end to end. >>=20 >> Sent from my iPhone >>=20 >> On Sep 28, 2013, at 12:40, Grant Ingersoll = wrote: >>=20 >>> Moving closer to 1.0, removing cruft, etc. Do we have any more = major features planned for 1.0? I think we said during 0.8 that we = would try to follow pretty quickly w/ another release. >>>=20 >>> -Grant >>>=20 >>> On Sep 28, 2013, at 12:33 PM, Ted Dunning = wrote: >>>=20 >>>> Sounds right in principle but perhaps a bit soon. >>>>=20 >>>> What would define the release? >>>>=20 >>>> Sent from my iPhone >>>>=20 >>>> On Sep 27, 2013, at 7:48, Grant Ingersoll = wrote: >>>>=20 >>>>> Anyone interested in thinking about 0.9 in the early Nov. time = frame? >>>>>=20 >>>>> -Grant >>> -------------------------------------------- >>> Grant Ingersoll | @gsingers >>> http://www.lucidworks.com >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>=20 >>=20 >>=20 >>=20 >=20