hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mayank Lahiri (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-1372) New algorithm for variance() UDAF
Date Wed, 02 Jun 2010 21:52:02 GMT

     [ https://issues.apache.org/jira/browse/HIVE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Mayank Lahiri updated HIVE-1372:

    Attachment: HIVE-1372.3.patch

AFAIK, this is a floating point rounding error. I ran some tests on millions of large random
doubles and the differences are consistently in the last few significant digits. Curiously,
even the vanilla un-modified sum() UDAF produces some differences in the last few digits from
R's output when operating on large-ish synthetic data, which leads me to believe that either
Hive or Java's default println is pushing out a few more digits than it should, or Java's
floating point handling is somehow quirky in terms of rounding.

I've corrected the two .q.out files and attached the patch.

> New algorithm for variance() UDAF
> ---------------------------------
>                 Key: HIVE-1372
>                 URL: https://issues.apache.org/jira/browse/HIVE-1372
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>            Priority: Minor
>             Fix For: 0.6.0
>         Attachments: HIVE-1372.2.patch, HIVE-1372.3.patch, HIVE-1372.patch
> A new algorithm for the UDAF that computes variance. This is pretty much a drop-in replacement
for the current UDAF, and has two benefits: provably numerically stable (reference included
in comments), and reduces arithmetic operations by about half.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message