hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Ulrik Søttrup <soett...@nbi.dk>
Subject Re: joining two large files in hadoop
Date Sun, 05 Apr 2009 14:12:27 GMT
Hi Todd,

Thanks, I already know that trick. I probably shouldn't have said means. 
Actually it is a combined high dimensional matrix representing the 
"mean" function of a bunch of other matrices.
Then I need to find the mean difference and variance between the 
constituent matrices and the combined matrix. Here I can use your trick.

cheers,
Christian

Todd Lipcon wrote:
> On Sat, Apr 4, 2009 at 2:11 PM, Christian Ulrik Søttrup <soettrup@nbi.dk>wrote:
>
>   
>> Hello all,
>>
>> I need to do some calculations that has to merge two sets of very large
>> data  (basically calculate variance).
>> One set contains a set of "means" and the second  a set of objects tied to
>> a mean.
>>
>> Normally I would  send the set of means using the distributed cache, but
>> the set has become too large to keep in memory and it is going to grow in
>> the future.
>>
>>     
>
> Hi Christian,
>
> Others have done a good job answering your question about doing this as a
> join, but here's one idea that might allow you to skip the join altogether:
>
> If you're simply calculating variance of data sets, you can use a bit of a
> math trick to do it in one pass without precomputing the means:
>
> E = the expectation operator
> mu = mean = E[x]
> Variance = E[ (x - mu)^2 ]
>
> Expand the square:
> = E[x^2 - 2*x*mu + mu^2]
>
> by linearity of expectation:
> = E[x^2] - 2*mu*E[x] + E[mu^2]
>
> mu in this equation is constant, so E[mu^2] = mu^2.
> Also recall that E[x] = mu
> = E[x^2] - 2*E[x]^2 + E[x]^2
> = E[x^2] - E[x]^2
>
> Apologies for the ugly math notation, but hopefully it's clear. The takeaway
> is that you can separately calculate sum(x^2) and sum(x) in your job, and
> calculate variance directly from the results. Here's the general outline for
> the MR job:
>
> Map:
>   collect (1, x, x^2)
> Combine:
>   sum up tuples
> Reduce:
>   input from combine: (N, sum(x), sum(x^2))
>   output: Variance = 1/N(sum(x^2)) - (1/N sum(x))^2
>
> Hope that's helpful for you!
>
> -Todd
>
>   


Mime
View raw message