hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kramer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-362) avg() returns null if any item in the list is null; create function that doesn't
Date Tue, 24 Mar 2009 18:18:56 GMT

    [ https://issues.apache.org/jira/browse/HIVE-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688777#action_12688777

Adam Kramer commented on HIVE-362:

Very sorry! After further testing, the actual problem I was having was  that SUM returns a
double, and I was putting it into an INT container without an explicit cast.

So, maybe this bug is a request to change the wiki to make this clear.

Also...let's actually say the opposite. We really should have AVG or some version of it return
NULL if any NULL values are found. This is the default behavior in R, and for good reason...I
will link to the description of why once I find it.

Basically, the argument is that when a sum, average, or variance depends on n in some sense,
comparisons among these groups cannot be trusted. For example, if there are 5 nulls in a set
of 20 variables, SUM(x) / COUNT(x) returns a different value than AVG(x)...that is very unintuitive.

> avg() returns null if any item in the list is null; create function that doesn't
> --------------------------------------------------------------------------------
>                 Key: HIVE-362
>                 URL: https://issues.apache.org/jira/browse/HIVE-362
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Adam Kramer
> Some of the current aggregates (sum, avg) have a fairly standard behavior: If any item
in the list is NULL, the sum, average, etc., cannot be computed. And so, NULL is returned.
> 1) If this is the case, the query should return much faster--see a null, return NULL,
> 2) It would be nice to have versions or ways to use these functions with NULL data--specifically,
to treat the NULL as zero or to ignore the NULL and return the results for non-NULL data.
> This also would apply to the variance functions referenced in https://issues.apache.org/jira/browse/HIVE-165

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message