[ https://issues.apache.org/jira/browse/HIVE1397?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Mayank Lahiri updated HIVE1397:

Attachment: Histogram_quality.png.jpg
Since there is no approximation guarantee with this streaming algorithm, I ran a pretty comprehensive
set of experiments to determine what the quality of the computed histogram is in terms of
MeanSquaredError, and using R's hist() function as a gold standard.
I simulated a variety of conditions on the data and the Map/Reduce setup. To summarize, I
ran multiple repetitions of histogrambuilding using the UDAF and R on the cross product:
{chunks of sorted data to each mapper, semisorted data to each mapper, random data to each
mapper} x
{ few mappers relative to data size, medium number of mappers, very high number of mappers
} x
{ 10  80 histogram bins in steps of 10 }
The image shows the MSE of the UDAF histogram() as a function of R's histogram for the same
number of bins. A value of 1.0 means that Hive and R histograms are on par. Values less than
1.0 mean that Hive histogram is actually *better* than R's histogram, in terms of MSE.
At 30 and 60 bins, the MSEs of both R and Hive are very small, and the discrepancy seems to
be R doing something funky at those points.
In summary, it looks pretty good!
> histogram() UDAF for a numerical column
> 
>
> Key: HIVE1397
> URL: https://issues.apache.org/jira/browse/HIVE1397
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Mayank Lahiri
> Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
> Attachments: Histogram_quality.png.jpg, HIVE1397.1.patch
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, short,
double, long, etc.) column. The result is returned as a map of (x,y) histogram pairs, and
can be plotted in Gnuplot using impulses (for example). The algorithm is currently adapted
from "A streaming parallel decision tree algorithm" by BenHaim and TomTov, JMLR 11 (2010),
and uses space proportional to the number of histogram bins specified. It has no approximation
guarantees, but seems to work well when there is a lot of data and a large number (e.g. 50100)
of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map object.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.
