Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 395 invoked from network); 9 Nov 2009 12:57:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Nov 2009 12:57:46 -0000 Received: (qmail 96327 invoked by uid 500); 9 Nov 2009 12:57:45 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 96275 invoked by uid 500); 9 Nov 2009 12:57:45 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 96261 invoked by uid 99); 9 Nov 2009 12:57:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2009 12:57:45 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of srowen@gmail.com designates 209.85.218.222 as permitted sender) Received: from [209.85.218.222] (HELO mail-bw0-f222.google.com) (209.85.218.222) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2009 12:57:43 +0000 Received: by bwz22 with SMTP id 22so3741428bwz.5 for ; Mon, 09 Nov 2009 04:57:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=tDFEr9Pc5akyJv5iuHck+KEpBY9Hp3YYNFMZNUEFgiA=; b=Ihpln6hrf8c+4QxYcD08BpAZcK5dNHdfcAsfbDDq7XiinHQEzBMeY40CeZn7OvvH7v N+4ZTiSQbGYcl4MiLxM2EvAJo9rH1ILt1tKMcoOwPdWPrZa9KZ6t0kZiIanNjx+ImRNa +8rA9TdINix6znH7Jcv6xwwhGKODHIMbY7ass= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=WyNg2d+TH12kUamxzBW9cAb3uAhm6K5GdGdJate8nTcPj1VsO0qQtA5Ym0/7EsAi0k eak+6g1xBRcP8bY/7qVPPkBJeXfuiykyT51qMdWJpRIrF3PmZplrr0bYeugZqxTzC/Ak /23dFJTIJXz3lCpiltmggLuL6PQVq27UZjm+4= MIME-Version: 1.0 Received: by 10.239.139.91 with SMTP id s27mr850691hbs.84.1257771442024; Mon, 09 Nov 2009 04:57:22 -0800 (PST) In-Reply-To: References: <200911091442177503379@163.com> <702752.76072.qm@web15603.mail.cnb.yahoo.com> Date: Mon, 9 Nov 2009 12:57:21 +0000 Message-ID: Subject: Re: Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy From: Sean Owen To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yes, I agree that keeping all pairs is quite expensive, unless your data set is relatively small (like tens of thousands of items). If you're not running out of memory, OK, you can get away with it for now. But yes, many of the similarities will not contain much information and don't add much value -- the question is, which ones? For Pearson correlation-based similarity, it's not just a matter of keeping the ones with the largest and smallest similarity scores -- nearest 1 or -1. A similarity of 0 could still be very useful information. I think you would actually want to keep an item-item pair based on how many users expressed a preference for both items. The more, the more important it is to keep that pair. If you'd like an example of efficiently looking through a large list of things, and keeping only the "top n" of them, see the TopItems class. You don't want to generate all pairs at once, then throw some away -- that would still run you out of memory. Ted will say, and I again I agree, that Pearson is not usually the best similarity metric, though it is widely mentioned in collaborative filtering examples and literature. What Ted quotes below is implemented in the framework as LogLikelihoodSimilarity. For that, I believe it *is* the pairs with the largest resulting similarity score that you do want to keep. Or at least it is more reasonable. Ted maybe you can check my thinking on that. Sean On Mon, Nov 9, 2009 at 7:09 AM, Ted Dunning wrote: > Close. > > See the link below for one approach to finding the most important ones. = =C2=A0I > believe that Sean has added something like this to Taste/Mahout. > > http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html