hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tamir Kamara (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1150) VAR() Variance UDF
Date Thu, 17 Dec 2009 13:04:18 GMT

    [ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791909#action_12791909
] 

Tamir Kamara commented on PIG-1150:
-----------------------------------

This can be very useful for me so I tested your patch but got weird results. I believe that
the problem is at combine method - it treats the tuple as if it contains the original values
but to my understanding it should work with the intermediate output and do something like
this:


{code}
static protected Tuple combine(DataBag values) throws ExecException {
	double sum = 0;
	long count = 0;
	double sumOfSquares = 0;

	Tuple output = mTupleFactory.newTuple(3);

	for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
		Tuple t = it.next();

		sum += (Double) t.get(0);
		count += (Long) t.get(1);
		sumOfSquares += (Double) t.get(2);
		
	}

	output.set(0, sum);
	output.set(1, count);
	output.set(2, sumOfSquares);

	return output;
}
{code}

> VAR() Variance UDF
> ------------------
>
>                 Key: PIG-1150
>                 URL: https://issues.apache.org/jira/browse/PIG-1150
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.5.0
>         Environment: UDF, written in Pig 0.5 contrib/
>            Reporter: Russell Jurney
>             Fix For: 0.7.0
>
>         Attachments: var.patch
>
>
> I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in
a distributed manner, based on the AVG() builtin.  It works by calculating the count, sum
and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
> Is this a worthwhile contribution?  Taking the square root of this value using the contrib
SQRT() function gives Standard Deviation, which is missing from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message