Return-Path: Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: (qmail 19231 invoked from network); 24 Jun 2010 13:19:40 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 24 Jun 2010 13:19:40 -0000 Received: (qmail 75544 invoked by uid 500); 24 Jun 2010 13:19:40 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 75188 invoked by uid 500); 24 Jun 2010 13:19:37 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 75179 invoked by uid 99); 24 Jun 2010 13:19:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jun 2010 13:19:36 +0000 X-ASF-Spam-Status: No, hits=4.7 required=10.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of vivekkhanna@hotmail.com designates 65.55.34.16 as permitted sender) Received: from [65.55.34.16] (HELO col0-omc1-s6.col0.hotmail.com) (65.55.34.16) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jun 2010 13:19:27 +0000 Received: from COL124-W6 ([65.55.34.7]) by col0-omc1-s6.col0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Thu, 24 Jun 2010 06:14:57 -0700 Message-ID: Content-Type: multipart/alternative; boundary="_85566ad4-ae9a-4b49-b221-00578af63cbb_" X-Originating-IP: [68.50.157.54] From: Vivek Khanna To: Subject: RE: User/Items Reco Engine clustering Date: Thu, 24 Jun 2010 09:14:57 -0400 Importance: Normal In-Reply-To: References: ,,, MIME-Version: 1.0 X-OriginalArrivalTime: 24 Jun 2010 13:14:57.0655 (UTC) FILETIME=[3E531070:01CB139F] X-Virus-Checked: Checked by ClamAV on apache.org --_85566ad4-ae9a-4b49-b221-00578af63cbb_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Another way to look at the problem is to consider user purchases/actions as= features describing a user in a vector space. Then the problem is reduced = to finding users similar to each other based on this feature set. Clustering would be overly complex in my humble opinion. I agree with Sean that the Lucene based construction as you describe it Jay= =2C is item-based and not user-based.=20 Hope this helps. > Date: Wed=2C 23 Jun 2010 08:59:03 +0100 > Subject: Re: User/Items Reco Engine clustering > From: srowen@gmail.com > To: dev@mahout.apache.org >=20 > To me you're just describing user-based recommendation. You find a > neighborhood of similar users=2C then examine their items=2C and recommen= d > from those by taking a weighted average of the neighborhood's > preferences. >=20 > Your Lucene-based construction then sounds like item-based > recommendation. Find items similar to what the user prefers and > recommend based on a weighted average=2C again. >=20 > Do I have that right? >=20 > And then=2C do you need a Hadoop-based implementation using SequenceFiles= ? > What kind of data size are you looking at? >=20 > On Wed=2C Jun 23=2C 2010 at 12:49 AM=2C Jay Sellers wrote: > > Thanks Vivek=2C > > We do not have predefined clusters/groups. We expect the groups to muta= te as > > more history (data) is accumulated. A simple use case is as follows: > > John has viewed a pair of jeans=2C a cowboy hat=2C a red shirt and a pa= ir of > > boots. > > Scott has viewed a pair of jeans=2C a cowboy hat=2C a red shirt and a p= ocket > > watch. > > Larry has viewed a pair of jeans=2C a cowboy hat and a red shirt. > > > > When we send Larry and his items into our reco engine=2C we would expec= t a > > pair of boots and a pocket watch to be recommended. We'd expect this > > because we've determined that John and Scott are 'like' Larry and thus = are > > in the same cluster. > > > > Again=2C we fully expect the cluster members to change=2C as user/item = data > > accumulates. > > > > On Tue=2C Jun 22=2C 2010 at 4:37 PM=2C Vivek Khanna wrote: > > > >> > >> Hi=2C > >> > >> > >> > >> For your clustering/grouping=2C what is your expectation? Do you have > >> pre-defined clusters/groups that you want to cluster the items within = those=2C > >> or do you envision a system where clusters/groups will change and evol= ve as > >> the data changes? > >> > >> > >> > >> In each case=2C it seems you are looking for unsupervised approaches. = Is that > >> correct? > >> > >> > >> > >> I am new to this email list=2C so pardon my ignorance=2C but from what= work I > >> have done in the past with IR=2C ML (clustering=2C More like this=2C > >> categorization=2C topic detection etc.)=2C my advice to you is to iden= tify your > >> requirements=2C use cases and page flow interactions as the first step= . :) > >> > >> > >> > >> Hope this helps! > >> > >> Vivek. > >> > >> > Date: Tue=2C 22 Jun 2010 15:50:18 -0700 > >> > Subject: User/Items Reco Engine clustering > >> > From: jaysellers@gmail.com > >> > To: dev@mahout.apache.org > >> > > >> > I'm looking to enhance a product recommendation engine. It currently > >> works > >> > with all data as a whole. I want to introduce clustering/grouping. I= ts > >> > model based and the relationship is the common User-Items relationsh= ip. > >> > Originally I was thinking of using a Canopy / kmeans cluster. And th= en > >> > determine which cluster a user is in and execute Item Similarity aga= inst > >> > only that cluster of items. However I can't figure out how to build = a > >> > SequenceFile using vectors with the User/Items relationship. I don't= know > >> > which data points to feed the vector. So I scratched that idea and t= urned > >> > my attention to Lucene=2C thinking that this is really a document is= sue. > >> Where > >> > users are documents and the items are the content. I should be able = to > >> ask > >> > Lucene=2C give me documents that look like this "productId3 productI= d9056 > >> > productId234". > >> > > >> > I'm looking for any and all feedback from those experienced in the > >> > recommendation world=2C specifically with the grouping of users and = items. > >> > > >> > Thanks=2C > >> > -Jay > >> > >> _________________________________________________________________ > >> The New Busy is not the old busy. Search=2C chat and e-mail from your = inbox. > >> > >> http://www.windowslive.com/campaign/thenewbusy?ocid=3DPID28326::T:WLMT= AGL:ON:WL:en-US:WM_HMP:042010_3 > > =20 _________________________________________________________________ The New Busy is not the too busy. Combine all your e-mail accounts with Hot= mail. http://www.windowslive.com/campaign/thenewbusy?tile=3Dmultiaccount&ocid=3DP= ID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4= --_85566ad4-ae9a-4b49-b221-00578af63cbb_--