hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mayank Lahiri (JIRA)" <>
Subject [jira] Updated: (HIVE-1387) Make PERCENTILE work with double data type
Date Wed, 23 Jun 2010 23:33:50 GMT


Mayank Lahiri updated HIVE-1387:

    Attachment: HIVE-1387.2.patch

I've attached HIVE-1387.2.patch, which does the following:

(1) Creates a percentile_approx() UDAF which uses the histogram_numeric() UDAF to estimate
quantiles from a histogram. The syntax matches the existing percentile() UDAF, and extends
it with a third parameter that specifies the number of histogram bins to use (and thus, the
accuracy of quantile estimation):

SELECT percentile_approx(val, 0.5) FROM random;    // estimates the median
SELECT percentile_approx(val, array(0.5, 0.95, 0.98)) FROM random; // estimates 3 quantiles
SELECT percentile_approx(val, 0.5, 1000) FROM random; // estimates the median using 1,000
histogram bins instead of the default of 10,000.

(2) I've left the existing percentile() UDAF as it is for the following reasons: when the
number of unique values in a column is relatively small, percentile_approx() will return an
exact result. When the number of unique values in a column is very large (as one might expect
with double), then percentile() will run out of memory and crash, so there's really no need
to modify the existing percentile() to support doubles.

(3) The accuracy of quantile estimation seems to be pretty good. Attached a graph showing
approximation quality for the median using different histogram sizes for random datasets of
100,000 numbers. The default number of histogram bins is 10,000, which appears to work quite

(4) This patch also refactors the histogram_numeric() class to put all the generic histogram
functionality into a re-usable inner class. 

> Make PERCENTILE work with double data type
> ------------------------------------------
>                 Key: HIVE-1387
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Vaibhav Aggarwal
>            Assignee: Mayank Lahiri
>             Fix For: 0.6.0
>         Attachments: HIVE-1387.2.patch, median_approx_quality.png, patch-1387-1.patch
> The PERCENTILE UDAF does not work with double datatype.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message