Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of jake.mannix@gmail.com
 designates 209.85.216.171 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type;
        b=TrFTSGAYvuMeyraqE4OrbQEUuxThdpxD5ziJu3DzJQFnb830OtVAA8vMFPYcONyw6F
         OIbthlpFXtvroY55raFO/s8swoL0IwjR9R7Hmkyq3nfvpmbtSauxvt1vfT/lmw1gqMkK
         TW3qIfJPIeJefSYpgc8i8Z0lVy22eloPoodjk=
MIME-Version: 1.0
In-Reply-To: <e2e029610912111501k159e03d7r7a40584213964c50@mail.gmail.com>
References: <A7E4DAA2-E5C4-4D3D-8D3B-89186BBA4DE0@apache.org>
	 <4b124c310912080722jb9cd5c4kc308129de2af5e2f@mail.gmail.com>
	 <e2e029610912090133j7bc30b4eke88be57646ea96d2@mail.gmail.com>
	 <4b124c310912092004o30bec93dh59f71a598f2bcaac@mail.gmail.com>
	 <e2e029610912100123o36b555bftf8f4842b4ccb378f@mail.gmail.com>
	 <4b124c310912111423n1f4b77b2k79cf9c06d6db72d5@mail.gmail.com>
	 <e2e029610912111501k159e03d7r7a40584213964c50@mail.gmail.com>
Date: Fri, 11 Dec 2009 16:28:28 -0800
Message-ID: <4b124c310912111628k59d5e5a5j244f30b576d5924d@mail.gmail.com>
Subject: Re: Welcome Jake Mannix
From: Jake Mannix <jake.mannix@gmail.com>
To: Sean Owen <srowen@gmail.com>
Cc: mahout-dev@lucene.apache.org
Content-Type: multipart/alternative; boundary=001636e1fb5d885029047a7d1ebc

--001636e1fb5d885029047a7d1ebc
Content-Type: text/plain; charset=ISO-8859-1

On Fri, Dec 11, 2009 at 3:01 PM, Sean Owen <srowen@gmail.com> wrote:

> On Fri, Dec 11, 2009 at 10:23 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
> > Where are these hooks you're describing here?  The kind of general
> > framework I would imagine would be nice to have is something like this:
> > users and items themselves live as (semi-structured) documents (e.g. like
> > a Lucene Document, or more generally a Map<String, Map<String, Float>>,
> > where the first key is the "field name", and the values are bag-of-words
> > term-vectors or phrase vectors).
>
> In particular I'm referring to the ItemSimilarity interface. You stick
> that into an item-based recommender (which is really what Ted has been
> describing). So to do content-based recommendation, you just implement
> the notion of similarity based on content and send it in this way.
>

Ok, this kind of hook is good, but it leaves all of the work to the user -
it would
be nice to extend it along the lines I described, whereby developers can
define how to pull out various features of their items (or users), and then
give them a set of Similarities between those features, as well as
interesting
combining functions among those.


> Same with UserSimilarity and user-based recommenders.
>
> I imagine this problem can be reduced to a search problem. Maybe vice
> versa. I suppose my take on it -- and the reality of it -- is that
> what's there is highly specialized for CF. I think it's a good thing,
> since the API will be more natural and I imagine it'll be a lot
> faster. On my laptop I can do recommendations in about 10ms over 10M
> ratings.
>

Yeah, this is viewing it as a search problem, and similarly, you can do
search over 10-50M documents with often even under that latency with
Lucene, so there's no reason why the two could not be tied nicely together
to provide a blend of content and usage-based recommendations/searches.


> > Now the set of users by themselves, instead of just being labels on the
> > rows of the preference matrix, is a users-by-terms matrix, and the items,
> > instead of being just labels on the columns of the preference matrix, is
> > also a items-by-terms matrix.
>
> Yes, this is a fundamentally offline approach right? What exists now
> is entirely online. A change in data is reflected immediately. That's
> interesting and simple and powerful, but doesn't really scale -- my
> rule of thumb is that past 100M data points the non-distributed code
> isn't going to work. Below that size -- and that actually describe
>

Well, computing the user-item content-based similarity matrix *can*
be done offline, and once you have it, it can be used to produce
recommendations online, but another way to do it (and the way we do
it at LinkedIn), is to keep the items in Voldemort, and store them
"in transpose" in a Lucene index, and then compute similar items in
real time as a Lucene query.  Doing item-based recommendations this
way is just grabbing the sparse set of items a user prefers, OR'ing
these together (with boosts which encode the preferences), and
firing away a live search request.

It'll be a challenge to integrate content-based approaches to a larger
> degree than they already are: what can you really do but offer a hook
> to plug in some notion of similarity?
>

There are a ton of pluggable pieces: there's the hook for field-by-field
similarity (and not just the hook, but a bunch of common
implementations), sure, but then there's also a "feature processing /
extracting" phase, which will be very domain specific, and then the
scoring hook, where pairwise similarities among fields can be combined
nontrivially (via logistic regression, via some nonlinear kernel function,
etc...), as well as a separate system for people to actually *train* those
scorers - that in itself is a huge component.


> > Calculating the text-based similarity of *unstructured* documents is
> > one thing, and resolves just to figuring out whether you're doing
> > BM25, Lucene scoring, pure cosine - just a Similarity decision.
>
> Exactly and this is already implemented in some form as
> PearsonCorrelationSimilarity, for example. So the same bits of ideas
> are in the existing non-distributed code, it just looks different.
>

Again - the combination of "field" similarities into a whole Item similarity
is a piece which isn't as simple as Pearson / Cosine / Tanimoto - it's
a choice of parametrized function which may need to be trained, this
part is a new idea (to our recommenders).


> Basically you are clearly interested in
> org.apache.mahout.cf.taste.hadoop, and probably don't need to care
> about the rest unless you wish to. That's good because the new bits
> are the bits that aren't written and that I don't know a lot about.
>
> For example look at .item: this implement Ted's ideas. It's not quite
> complete -- I'm not normalizing the recommendation vector yet for
> example. So maybe that's a good place to dive in.
>

Yep, I'll look at those shortly, I'm definitely intersted in this.

  -jake

--001636e1fb5d885029047a7d1ebc--