Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (athena.apache.org: local policy includes SPF record at
 spf.trusted-forwarder.org)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\))
Subject: Re: Solr-recommender
From: Pat Ferrel <pat@occamsmachete.com>
In-Reply-To: <5255B477.2080001@safaribooksonline.com>
Date: Wed, 9 Oct 2013 14:07:18 -0700
Cc: Ted Dunning <ted.dunning@gmail.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <B14509DD-A4AA-460F-BEAD-3B044623A657@occamsmachete.com>
References: <6826931F-A27D-4479-8595-B9A870B1B25C@apache.org>
 <398A43A9-81DB-473B-8191-B1CBF4DCD114@gmail.com>
 <7BAE09E6-7FD7-4350-8077-ED994BFC2042@apache.org>
 <912C6EBA-A3EC-4DEF-AA36-DFBB25578C50@gmail.com>
 <AB5BB2E3-AA8B-49CE-A1E1-84F18CDB290C@gmail.com>
 <CAJwFCa3wTqSzT-Ya7j_ySs0C2MSQ_h_vRq73hz2wxDHBKet8zQ@mail.gmail.com>
 <AD010A74-95D1-4E6F-AF6A-1CE605F807B4@occamsmachete.com>
 <CAJwFCa0QB9pLyN8TLbrFMr9RAefcTOxFBqwsUerER1cGM2q12A@mail.gmail.com>
 <35C0D736-3A42-4896-BBF4-CC589AE4B96D@occamsmachete.com>
 <52555692.2000508@safaribooksonline.com>
 <BF1F8400-185E-431A-ACD1-60718ECA1B68@occamsmachete.com>
 <5255B477.2080001@safaribooksonline.com>
To: user@mahout.apache.org

1) Using the user history for the current user in a more-like-this query =
against the item-item similarity matrix will produce a user-history =
based recommendation. Simply fetching the item-item history row for a =
particular item will give you the item-similarity based recs with no =
account of user history. One could imagine a user-user similarity setup, =
but that's not what we did.

2) What you are doing is something else that I was calling a =
shopping-cart recommender. You are using the item-set in the current =
cart and finding similar, what, items? A different way to tackle this is =
to store all other shopping carts then use the current cart contents as =
a more-like-this query against past carts. This will give you =
items-purchased-together by other users. If you have enough carts it =
might give even better results. In any case they will be different.

https://github.com/pferrel/solr-recommender
But if you already have the item-item similarity matrix indexed this =
project wont add much. If you have purchase events and view-details =
events IDed by user you might try out the cross-recommender part. We've =
been searching for a data set to try this on.=20

On Oct 9, 2013, at 12:54 PM, Michael Sokolov =
<msokolov@safaribooksonline.com> wrote:

On 10/9/13 3:08 PM, Pat Ferrel wrote:
> Solr uses cosine similarity for it's queries. The implementation on =
github uses Mahout LLR for calculating the item-item similarity matrix =
but when you do the more-like-this query at runtime Solr uses cosine. =
This can be fixed in Solr, not sure how much work.
It's not clear to me whether it's worth "fixing" this or not.  It would =
certainly complicate scoring calculations when mixing with traditional =
search terms.
>=20
> It sounds like you are doing item-item similarities for =
recommendations, not actually calculating user-history based recs, is =
that true?
Yes that's true so far.  Our recommender system has the ability to =
provide recs based on user history, but we have not deployed this in our =
app yet.  My plan was simply to query based on all the items in the =
user's "basket" - not sure that this would require a different back end? =
 We're not at the moment considering user-user similarity measures.
>=20
> You bring up a point that we're finding. I'm not so sure we need or =
want a recommender query API that is separate from the Solr query API. =
What we are doing on our demo site is putting the output of the =
Solr-recommender where Solr can index it. Our web app framework then =
allows very flexible queries against Solr, using simple user history, =
producing the typical user-history based recommendations, or =
mixing/boosting based on metadata or contextual data. If we leave the =
recommender query API in Solr we get web app framework integration for =
free.
>=20
> Another point is where the data is stored for the running system. If =
we allow Solr to index from any storage service that it supports then we =
also get free integration with most any web app framework and storage =
service. For the demo site we put the data in a DB and have Solr index =
it from there. We also store the user history and metadata there. This =
is supported by most web app frameworks out of the box. You could go a =
different route and use almost any storage system/file system/content =
format since Solr supports a wide variety.
>=20
> Given a fully flexible Solr standard query and indexing scheme all you =
need do is tweak the query or data source a bit and you have an item-set =
recommender (shopping cart) or a contextual recommender (for example =
boost recs from a category) or a pure metadata/content based =
recommender.
>=20
> If the query and storage is left to Solr+web app framework then the =
github version is complete if not done. Solr still needs LLR in the =
more-like-this queries. Term weights to encode strength scores would =
also be nice and I agree that both of these could use some work.
I would like to take a look at that version - I may have missed some =
discussion about it; would you posting a link please?
>=20
> BTW lest we forget this does not imply the Solr-recommender is better =
than Myrrix or the Mahout-only recommenders. There needs to be some =
careful comparison of results. Michael, did you do offline or A/B tests =
during your implementation?

I ran some offline tests using our historical data, but I don't have a =
lot of faith in these beyond the fact they indicate we didn't make any =
obvious implementation errors.  We haven't attempted A/B testing yet =
since our site is so new, and we need to get a meaningful baseline going =
and sort out a lot of other more pressing issues on the site - =
recommendations are only one piece, albeit an important one.


Actually there was an interesting idea for an article posted recently =
about the difficulty of comparing results across systems in this field: =
http://www.docear.org/2013/09/23/research-paper-recommender-system-evaluat=
ion-a-quantitative-literature-survey/ but that's no excuse not to do =
better.  I'll certainly share when I know more :)

-Mike
>=20
> On Oct 9, 2013, at 6:13 AM, Michael Sokolov =
<msokolov@safaribooksonline.com> wrote:
>=20
> Just to add a note of encouragement for the idea of better integration =
between Mahout and Solr:
>=20
> On safariflow.com, we've recently converted our recommender, which =
computes similarity scores w/Mahout, from storing scores and running =
queries w/Postgres, to doing all that in Solr.  It's been a big =
improvement, both in terms of indexing speed, and more importantly, the =
flexibility of the queries we can write.  I believe that having scoring =
built in to the query engine is a key feature for recommendations.  More =
and more I am coming to believe that recommendation should just be =
considered as another facet of search: as one among many variables the =
system may take into account when presenting relevant information to the =
user.  In our system, we still clearly separate search from =
recommendations, and we probably will always do that to some extent, but =
I think we will start to blend the queries more so that there will be =
essentially a continuum of query options including more or less "user =
preference" data.
>=20
> I think what I'm talking about may be a bit different than what Pat is =
describing (in implementation terms), since we do LLR calculations =
off-line in Mahout and then bulk load them into Solr.  We took one of =
Ted's earlier suggestions to heart, and simply ignored the actual =
numeric scores: we index the top N similar items for each item.  Later =
we may incorporate numeric scores in Solr as term weights.  If people =
are looking for things to do :) I think that would be a great software =
contribution that could spur this effort onward since it's difficult to =
accomplish right now given the Solr/Lucene indexing interfaces, but is =
already supported by the underlying data model and query engine.
>=20
>=20
> -Mike
>=20
> On 10/2/13 12:19 PM, Pat Ferrel wrote:
>> Excellent. =46rom Ellen's description the first Music use may be an =
implicit preference based recommender using synthetic  data? I'm quickly =
discovering how flexible Solr use is in many of these cases.
>>=20
>> Here's another use you may have thought of:
>>=20
>> Shopping cart recommenders, as goes the intuition, are best modeled =
as recommending from similar item-sets. If you store all shopping carts =
as your training data (play lists, watch lists etc.) then as a user adds =
things to their cart you query for the most similar past carts. Combine =
the results intelligently and you'll have an item set recommender. Solr =
is built to do this item-set similarity. We tried to do this for a ecom =
site with pure Mahout but the similarity calc in real time stymied us. =
We knew we'd need Solr but couldn't devote the resources to spin it up.
>>=20
>> On the Con-side Solr has a lot of stuff you have to work around. It =
also does not have the ideal similarity measure for many uses (cosine is =
ok but llr would probably be better). You don't want stop word =
filtering, stemming, white space based tokenizing or n-grams. You would =
like explicit weighting. A good thing about Solr is how well it =
integrates with virtually any doc store independent of the indexing and =
query. A bit of an oval peg for a round hole.
>>=20
>> It looks like the similarity code is replaceable if not pluggable. =
Much of the rest could be trimmed away by config or adherence to =
conventions I suspect. In the demo site I'm working on I've had to adopt =
some slightly hacky conventions that I'll describe some day.
>>=20
>> On Oct 1, 2013, at 10:38 PM, Ted Dunning <ted.dunning@gmail.com> =
wrote:
>>=20
>>=20
>> Pat,
>>=20
>> Ellen and some folks in Britain have been working with some data I =
produced from synthetic music fans.
>>=20
>>=20
>> On Tue, Oct 1, 2013 at 2:22 PM, Pat Ferrel <pat@occamsmachete.com> =
wrote:
>> Hi Ellen,
>>=20
>>=20
>> On Oct 1, 2013, at 12:38 PM, Ted Dunning <ted.dunning@gmail.com> =
wrote:
>>=20
>>=20
>> As requested,
>>=20
>> Pat, meet Ellen.
>>=20
>> Ellen, meet Pat.
>>=20
>>=20
>>=20
>>=20
>> On Tue, Oct 1, 2013 at 8:46 AM, Pat Ferrel <pat.ferrel@gmail.com> =
wrote:
>> Tunneling (rat-holing?) into the cross-recommender and Solr+Mahout =
version.
>>=20
>> Things to note:
>> 1) The pure Mahout XRecommenderJob needs a cross-LLR or a =
cross-similairty job. Currently there is only cooccurrence for =
sparsification, which is far from optimal. This might take the form of a =
cross RSJ with two DRMs as input. I can't commit to this but would =
commit to adding it to the XRecommenderJob.
>> 2) output to Solr needs a lot of options implemented and tested. The =
hand-run test should be made into some junits. I'm slowly doing this.
>> 3) the Solr query API is unimplemented unless someone else is working =
on that. I'm building one in a demo site but it looks to me like a =
static recommender API is not going to be all that useful and maybe a =
document describing how to do it with the Solr query interface would be =
best, especially for a first step. The reasoning here is that it is so =
tempting to mix in metadata to the recommendation query that a static =
API is not so obvious. For the demo site the recommender API will be =
prototyped in a bunch of ways using models and controllers in Rails. If =
I'm the one to do the a Java Solr-recommender query API it will be after =
experimenting a bit.
>>=20
>> Can someone introduce me to Ellen and Tim?
>>=20
>> On Sep 28, 2013, at 10:59 AM, Ted Dunning <ted.dunning@gmail.com> =
wrote:
>>=20
>> The one large-ish feature that I think would find general use would =
be a high performance classifier trainer.
>>=20
>> Flor cleanup sort of thing it would be good to fully integrate the =
streaming k-means into the normal clustering commands while revamping =
the command line API.
>>=20
>> Dmitriy's recent scala work would help quite a bit before 1.0. Not =
sure it can make 0.9.
>>=20
>> For recommendations, I think that the demo system that pat started =
with the elaborations by Ellen an Tim would be very good to have.
>>=20
>> I would be happy to collaborate with somebody on these but am not at =
all likely to have time to actually do them end to end.
>>=20
>> Sent from my iPhone
>>=20
>> On Sep 28, 2013, at 12:40, Grant Ingersoll <gsingers@apache.org> =
wrote:
>>=20
>>> Moving closer to 1.0, removing cruft, etc.  Do we have any more =
major features planned for 1.0?  I think we said during 0.8 that we =
would try to follow pretty quickly w/ another release.
>>>=20
>>> -Grant
>>>=20
>>> On Sep 28, 2013, at 12:33 PM, Ted Dunning <ted.dunning@gmail.com> =
wrote:
>>>=20
>>>> Sounds right in principle but perhaps a bit soon.
>>>>=20
>>>> What would define the release?
>>>>=20
>>>> Sent from my iPhone
>>>>=20
>>>> On Sep 27, 2013, at 7:48, Grant Ingersoll <gsingers@apache.org> =
wrote:
>>>>=20
>>>>> Anyone interested in thinking about 0.9 in the early Nov. time =
frame?
>>>>>=20
>>>>> -Grant
>>> --------------------------------------------
>>> Grant Ingersoll | @gsingers
>>> http://www.lucidworks.com
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>=20
>>=20
>>=20
>>=20
>=20