pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tamir Kamara (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1150) VAR() Variance UDF
Date Thu, 17 Dec 2009 13:04:18 GMT

    [ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791909#action_12791909

Tamir Kamara commented on PIG-1150:

This can be very useful for me so I tested your patch but got weird results. I believe that
the problem is at combine method - it treats the tuple as if it contains the original values
but to my understanding it should work with the intermediate output and do something like

static protected Tuple combine(DataBag values) throws ExecException {
	double sum = 0;
	long count = 0;
	double sumOfSquares = 0;

	Tuple output = mTupleFactory.newTuple(3);

	for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
		Tuple t = it.next();

		sum += (Double) t.get(0);
		count += (Long) t.get(1);
		sumOfSquares += (Double) t.get(2);

	output.set(0, sum);
	output.set(1, count);
	output.set(2, sumOfSquares);

	return output;

> VAR() Variance UDF
> ------------------
>                 Key: PIG-1150
>                 URL: https://issues.apache.org/jira/browse/PIG-1150
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.5.0
>         Environment: UDF, written in Pig 0.5 contrib/
>            Reporter: Russell Jurney
>             Fix For: 0.7.0
>         Attachments: var.patch
> I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in
a distributed manner, based on the AVG() builtin.  It works by calculating the count, sum
and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
> Is this a worthwhile contribution?  Taking the square root of this value using the contrib
SQRT() function gives Standard Deviation, which is missing from Pig.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message