Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 62354 invoked from network); 12 Dec 2009 00:28:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Dec 2009 00:28:52 -0000 Received: (qmail 29354 invoked by uid 500); 12 Dec 2009 00:28:51 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 29297 invoked by uid 500); 12 Dec 2009 00:28:51 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 29286 invoked by uid 99); 12 Dec 2009 00:28:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Dec 2009 00:28:51 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jake.mannix@gmail.com designates 209.85.216.171 as permitted sender) Received: from [209.85.216.171] (HELO mail-px0-f171.google.com) (209.85.216.171) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Dec 2009 00:28:48 +0000 Received: by pxi1 with SMTP id 1so1519559pxi.29 for ; Fri, 11 Dec 2009 16:28:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=M7xc39HlkwRHMnoPvnEV3ZeENcZUIh2WHn7p/YmT7W0=; b=lrXjs7yLga3YGOp9bPIntdrFRXmLxnyw76JE4CFwNiXyIp/CWYYAWsf5mUYpwYzw2L JHDR7p+oUilzhXuAmd6kFYU9HGNGmcgXbTsUko4utZb/+0xFyEJIlGNFlw120F8vJ8PJ J7TN/mPmzIXIgaQ3k91wWGiTkjygEHVRncq7w= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=TrFTSGAYvuMeyraqE4OrbQEUuxThdpxD5ziJu3DzJQFnb830OtVAA8vMFPYcONyw6F OIbthlpFXtvroY55raFO/s8swoL0IwjR9R7Hmkyq3nfvpmbtSauxvt1vfT/lmw1gqMkK TW3qIfJPIeJefSYpgc8i8Z0lVy22eloPoodjk= MIME-Version: 1.0 Received: by 10.142.75.12 with SMTP id x12mr1187394wfa.157.1260577708592; Fri, 11 Dec 2009 16:28:28 -0800 (PST) In-Reply-To: References: <4b124c310912080722jb9cd5c4kc308129de2af5e2f@mail.gmail.com> <4b124c310912092004o30bec93dh59f71a598f2bcaac@mail.gmail.com> <4b124c310912111423n1f4b77b2k79cf9c06d6db72d5@mail.gmail.com> Date: Fri, 11 Dec 2009 16:28:28 -0800 Message-ID: <4b124c310912111628k59d5e5a5j244f30b576d5924d@mail.gmail.com> Subject: Re: Welcome Jake Mannix From: Jake Mannix To: Sean Owen Cc: mahout-dev@lucene.apache.org Content-Type: multipart/alternative; boundary=001636e1fb5d885029047a7d1ebc --001636e1fb5d885029047a7d1ebc Content-Type: text/plain; charset=ISO-8859-1 On Fri, Dec 11, 2009 at 3:01 PM, Sean Owen wrote: > On Fri, Dec 11, 2009 at 10:23 PM, Jake Mannix > wrote: > > Where are these hooks you're describing here? The kind of general > > framework I would imagine would be nice to have is something like this: > > users and items themselves live as (semi-structured) documents (e.g. like > > a Lucene Document, or more generally a Map>, > > where the first key is the "field name", and the values are bag-of-words > > term-vectors or phrase vectors). > > In particular I'm referring to the ItemSimilarity interface. You stick > that into an item-based recommender (which is really what Ted has been > describing). So to do content-based recommendation, you just implement > the notion of similarity based on content and send it in this way. > Ok, this kind of hook is good, but it leaves all of the work to the user - it would be nice to extend it along the lines I described, whereby developers can define how to pull out various features of their items (or users), and then give them a set of Similarities between those features, as well as interesting combining functions among those. > Same with UserSimilarity and user-based recommenders. > > I imagine this problem can be reduced to a search problem. Maybe vice > versa. I suppose my take on it -- and the reality of it -- is that > what's there is highly specialized for CF. I think it's a good thing, > since the API will be more natural and I imagine it'll be a lot > faster. On my laptop I can do recommendations in about 10ms over 10M > ratings. > Yeah, this is viewing it as a search problem, and similarly, you can do search over 10-50M documents with often even under that latency with Lucene, so there's no reason why the two could not be tied nicely together to provide a blend of content and usage-based recommendations/searches. > > Now the set of users by themselves, instead of just being labels on the > > rows of the preference matrix, is a users-by-terms matrix, and the items, > > instead of being just labels on the columns of the preference matrix, is > > also a items-by-terms matrix. > > Yes, this is a fundamentally offline approach right? What exists now > is entirely online. A change in data is reflected immediately. That's > interesting and simple and powerful, but doesn't really scale -- my > rule of thumb is that past 100M data points the non-distributed code > isn't going to work. Below that size -- and that actually describe > Well, computing the user-item content-based similarity matrix *can* be done offline, and once you have it, it can be used to produce recommendations online, but another way to do it (and the way we do it at LinkedIn), is to keep the items in Voldemort, and store them "in transpose" in a Lucene index, and then compute similar items in real time as a Lucene query. Doing item-based recommendations this way is just grabbing the sparse set of items a user prefers, OR'ing these together (with boosts which encode the preferences), and firing away a live search request. It'll be a challenge to integrate content-based approaches to a larger > degree than they already are: what can you really do but offer a hook > to plug in some notion of similarity? > There are a ton of pluggable pieces: there's the hook for field-by-field similarity (and not just the hook, but a bunch of common implementations), sure, but then there's also a "feature processing / extracting" phase, which will be very domain specific, and then the scoring hook, where pairwise similarities among fields can be combined nontrivially (via logistic regression, via some nonlinear kernel function, etc...), as well as a separate system for people to actually *train* those scorers - that in itself is a huge component. > > Calculating the text-based similarity of *unstructured* documents is > > one thing, and resolves just to figuring out whether you're doing > > BM25, Lucene scoring, pure cosine - just a Similarity decision. > > Exactly and this is already implemented in some form as > PearsonCorrelationSimilarity, for example. So the same bits of ideas > are in the existing non-distributed code, it just looks different. > Again - the combination of "field" similarities into a whole Item similarity is a piece which isn't as simple as Pearson / Cosine / Tanimoto - it's a choice of parametrized function which may need to be trained, this part is a new idea (to our recommenders). > Basically you are clearly interested in > org.apache.mahout.cf.taste.hadoop, and probably don't need to care > about the rest unless you wish to. That's good because the new bits > are the bits that aren't written and that I don't know a lot about. > > For example look at .item: this implement Ted's ideas. It's not quite > complete -- I'm not normalizing the recommendation vector yet for > example. So maybe that's a good place to dive in. > Yep, I'll look at those shortly, I'm definitely intersted in this. -jake --001636e1fb5d885029047a7d1ebc--