Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 63703 invoked from network); 29 Dec 2010 17:49:37 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Dec 2010 17:49:37 -0000 Received: (qmail 17116 invoked by uid 500); 29 Dec 2010 17:49:37 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 17028 invoked by uid 500); 29 Dec 2010 17:49:36 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 17020 invoked by uid 99); 29 Dec 2010 17:49:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Dec 2010 17:49:36 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of andyparsons@gmail.com designates 209.85.212.42 as permitted sender) Received: from [209.85.212.42] (HELO mail-vw0-f42.google.com) (209.85.212.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Dec 2010 17:49:29 +0000 Received: by vws11 with SMTP id 11so3967748vws.1 for ; Wed, 29 Dec 2010 09:49:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:content-type:mime-version :subject:from:in-reply-to:date:content-transfer-encoding:message-id :references:to:x-mailer; bh=fi4pCyp3kpuXMuQgHi6miOL35kKoKw4P6lLxv1kN98s=; b=mCPVgdGQlIIGlcKoJP8FCcqna7im2Otb32mJ/wfqR57NrKQp0Xi5HM7cxW+LmZqMAp hkjPhg06bYSrWrbMfTQcVKOKPu9EHWkQZ/mLlF5jOuI5wNLurpmwTCyThyDFYaq0zteG mWmDWnxD2CTcmnBewnmeaOLW1sxQ7GLfvd3zY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; b=YidjsLRdjrEN3x+IKxbBC1vvDCAwsvL88U3YZ7H7903mocCTetdwnKryTLYtQr/zVt RGATMzNxVmrHXqsiy+1/j0DaVVkulBA1ta/X+Qw8euSWpEcU3Iy0T5URODUAq5EQXRt7 U+8VJ1v33aifq2pfujEwtuUgarNSH9L2qFGkk= Received: by 10.220.100.143 with SMTP id y15mr3101975vcn.174.1293644947994; Wed, 29 Dec 2010 09:49:07 -0800 (PST) Received: from [172.30.175.78] ([170.20.11.118]) by mx.google.com with ESMTPS id b26sm5389067vby.13.2010.12.29.09.49.06 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 29 Dec 2010 09:49:06 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1082) Subject: Re: Evaluating Mahout's recommender support From: Andy Parsons In-Reply-To: Date: Wed, 29 Dec 2010 12:49:05 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: References: <4D1AFAAA.6090206@apache.org> To: user@mahout.apache.org X-Mailer: Apple Mail (2.1082) Thanks Sean and Sebastian. I've responded to the questions inline: On Dec 29, 2010, at 6:26 AM, Sean Owen wrote: > Yeah that review was, IMHO, had issues. It's important to note the > context: the person was selling their own services. It was trying to > run some sample code, non-distributed code, in a sort of distributed > fashion. The result was predictably not so good. That was a long time > ago. >=20 > 2M users and 10M items isn't big even for a non-distributed > recommender. This doesn't even sound hard for a non-distributed Mahout > recommender. Sure, let's hear more and we can give some ideas. >=20 > On Wed, Dec 29, 2010 at 4:08 AM, Sebastian Schelter = wrote: >> Hi all, >>=20 >> once again, I'm moving a twitter conversation to this mailing list. >>=20 >> Let me introduce Andy, who is currently evaluating recommendation >> components for his NYC located startup and looking into Mahout for = that >> reason: >>=20 >> "We are coding primarily in Scala and looking to build or license a >> recommendation component. The base requirement is that it be capable = of >> hybrid recommendations on a body of ~2MM users and ~10MM items with = rich >> metadata. The paper I referenced seems to indicate Mahout is not a >> great fit- can you point me to recent improvements that make the >> assertions in the paper obsolete? Any guidance is very much = appreciated!" >>=20 >> The paper which he's quoting is an old review of Mahout's recommender >> support available at >> http://www.iletken-project.com/documents/mahout_review_by_iletken.pdf = . >> I think we should give great advice to Andy and simulatenously give = the >> community an update about the criticized facts in that review that = are >> not true anymore. >>=20 >> I'll make a first try to address the state of that review: >>=20 >> - Mahout currently offers parallel algorithms for Collaborative >> Filtering, see >> = https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative= +Filtering >> which can also be used to precompute a model which can than be used = for >> online recommendations. >>=20 >> - Mahout has some support for matrix factorization based = recommenders ( >> = https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mah= out/cf/taste/impl/recommender/svd/SVDRecommender.html >> ), a superior algrithm to this ( >> https://issues.apache.org/jira/browse/MAHOUT-525 ) as well as a = parallel >> implementation ( https://issues.apache.org/jira/browse/MAHOUT-542 ) = are >> currently in the making >>=20 >> -The memory consumption of Taste has significantly improved, I never >> tried to load the Netflix dataset, but I'm pretty sure it fits into = some >> hundred megabytes of memory. >>=20 >> Furthermore I think we need to know more details about Andy's usecase = to >> give him proper answers about Mahout fitting his project: >>=20 >> - Do you have explicit ratings from the users or are you working with >> implicit data? [ASP] We will have both, in the form of ratings, views/purchases, and = "recommend to a friend" >>=20 >> - What do you exactly mean by hybrid recommendations? Do you mean a >> combination of content based and collaborative filtering techniques? [ASP] Yes, precisely. >>=20 >> - How fast do you need the recommendations? Would it be ok to have = them >> precomputed on a daily basis e.g. or do you need them in realtime? [ASP] Either *could* work, with a preference for realtime. >>=20 >> - How often do new users and new items enter your dataset? How sparse = is >> your rating data? [ASP] New users are added in the hundreds on a daily basis. Rating data = will be very sparse in the initial months the application is live, so we = are looking at options for priming the system. Given the quantity of = items, however, we'll have fairly sparse rating/item coverage in = general. >>=20 >> --sebastian >>=20