From mahout-user-return-1813-apmail-lucene-mahout-user-archive=lucene.apache.org@lucene.apache.org Fri Nov 27 12:01:37 2009
Return-Path:
Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org
Received: (qmail 8003 invoked from network); 27 Nov 2009 12:01:37 -0000
Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3)
by minotaur.apache.org with SMTP; 27 Nov 2009 12:01:37 -0000
Received: (qmail 92357 invoked by uid 500); 27 Nov 2009 12:01:36 -0000
Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org
Received: (qmail 92303 invoked by uid 500); 27 Nov 2009 12:01:35 -0000
Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: mahout-user@lucene.apache.org
Delivered-To: mailing list mahout-user@lucene.apache.org
Received: (qmail 92291 invoked by uid 99); 27 Nov 2009 12:01:35 -0000
Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Nov 2009 12:01:35 +0000
X-ASF-Spam-Status: No, hits=-0.0 required=10.0
tests=SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (nike.apache.org: domain of srowen@gmail.com designates 209.85.220.224 as permitted sender)
Received: from [209.85.220.224] (HELO mail-fx0-f224.google.com) (209.85.220.224)
by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Nov 2009 12:01:24 +0000
Received: by fxm24 with SMTP id 24so1274124fxm.11
for ; Fri, 27 Nov 2009 04:01:04 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=domainkey-signature:mime-version:received:in-reply-to:references
:date:message-id:subject:from:to:content-type;
bh=anxUy4dk+UNQgtM0KcHI0Wp1lN4fycc6+NPGHR3t++Q=;
b=GML+RG5GGlOCxXeMEWMmaS3DD4Sey2hw0RjyIbXa+QJdmrWPROrV1LnxOwQKeFIjNP
ZQQ+lWoY5uSasDmL6Fwlky7V7UnHJy47lL4xKbuzLHmD+DPrNYCiS3kmv4uuXds9Fpba
hTUGqejczgLKBlwdzWBU8s4C5XZwmzQ4+N3Co=
DomainKey-Signature: a=rsa-sha1; c=nofws;
d=gmail.com; s=gamma;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:content-type;
b=Q60p07wKxDKMLybGdoaMe+5OZ2+2FoYn2DZeqaEVVOwWdoWBuEZJy5nXN/6fq50Khn
HTOnoJPZQNtggETCKY4p4ocAn8sovyo+OC645ADO6K5fVh7ro+DLMn+iByylL5EYVFVm
BXZtK1IIzfWgK6ycvVCYxHuLuMD8B9Uk1DrIs=
MIME-Version: 1.0
Received: by 10.239.179.94 with SMTP id c30mr84744hbg.159.1259323264113; Fri,
27 Nov 2009 04:01:04 -0800 (PST)
In-Reply-To: <26540395.post@talk.nabble.com>
References: <26530825.post@talk.nabble.com>
<26533265.post@talk.nabble.com> <26535849.post@talk.nabble.com>
<26540395.post@talk.nabble.com>
Date: Fri, 27 Nov 2009 12:01:04 +0000
Message-ID:
Subject: Re: Mahout/Taste covariance between two items
From: Sean Owen
To: mahout-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8
X-Virus-Checked: Checked by ClamAV on apache.org
I'm not so familiar with this formula but you seem to be missing
something in the denominator... it's got to normalize somehow. I think
I said divide by standard deviation but that's not quite it. What you
are really summing are the products of z-scores. See
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
But I think you should just use the formulation given in the code?
should be the same result. At least I hope these aren't different
definitions of Pearson!
On Fri, Nov 27, 2009 at 10:20 AM, jamborta wrote:
>
> thanks you. much clearer now.
>
> so for my purpose this will do:
>
> sumXY/N-1
>
> given that the data is 'centered'?
>
>
> On Fri, Nov 27, 2009 at 1:41 AM, jamborta wrote:
>>
>> hi. I tried to figure out how you calcualte pearson correlation, but it
>> looks
>> like you use this formula:
>>
>> sumXY / sqrt(sumX2 * sumY2)
>
> Yes that's right -- this is what Pearson reduces to when the mean of X
> and Y are 0. And they are here -- the implementation 'centers' the
> data.
>
>> where sumXY = sumXY - meanY * sumX;
>> sumX2 = sumX2 - meanX * sumX;
>> sumY2 = sumY2 - meanY * sumY;
>
> You see the lines commented out there? Those are the full forms of the
> expressions, which may make more sense. This is centering the data,
> making the mean 0.
>
> This is a simplification based on the observation that, for example,
> sumX * meanY = sumY * meanX = n * meanY * meanX.
>
>>
>> i don't really understand how you got these equations. could you explain
>> it
>> to me? I thought pearson correlation would be like this
>>
>> E(x_i-meanX)(y_i-meanY) / sdX*sdY
>
> That's right that's the expression for a population correlation, but
> we can really only compute a sample Pearson correlation coefficient,
> yes:
>
>
>> for my project I would need to get sample correlation coefficient which
>> would be something like this:
>>
>> sum(x_i-meanX)(y_i-meanY)/(N-1)
>
> Yeah that's fine too, this is another way of expressing the formula,
> though you're missing the two standard deviations in the denominator.
> It'll be clearer if I note that the mean of X and Y are 0.
>
>
>
> --
> View this message in context: http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26540395.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>