hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shreepadma Venugopalan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-1362) column level statistics
Date Thu, 04 Oct 2012 19:15:48 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469614#comment-13469614
] 

Shreepadma Venugopalan commented on HIVE-1362:
----------------------------------------------

@Shrikanth: Thank you for your comments. We can certainly add a new UDAF with the Flajolet-Martin
sketch that returns a serialized numDV estimator. I've already filed a new JIRA (HIVE-3516)
for the incremental stats computation work. I'll add the UDAF as part of that JIRA. 

A couple of reasons why we decided to create a new compute_stats aggregation operator instead
of generating more expressions in the SQL,

1. We felt its a lot cleaner to encapsulate the stats for a column within a single UDAF. The
compute_stats UDAF returns a struct with the relevant stats depending on the data type of
the column and as a result makes the parsing as well as the SQL we generate simple.

2. Adding a new compute_stats UDAF allows the gathering of statistical summaries of the underlying
data even outside of the column stats framework. One use I can think of is, it can be used
to model the statistical properties of data which in turn can be used to generate data whose
statistical properties mimic that of the underlying data.

Even though max, min, total count exist as UDAFs today, we need these to be part of the histogram
UDAF. Estimating quantiles for equi-height histogram is a lot more efficient if we know the
range of values the column can take. We need to know the total_count to generate the histogram
bins. Given that we need these stats for generating histograms, I think its a good idea to
encapsulate all of these stats within the compute_stats UDAF. Thanks.
                
> column level statistics
> -----------------------
>
>                 Key: HIVE-1362
>                 URL: https://issues.apache.org/jira/browse/HIVE-1362
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Statistics
>            Reporter: Ning Zhang
>            Assignee: Shreepadma Venugopalan
>         Attachments: HIVE-1362.1.patch.txt, HIVE-1362.2.patch.txt, HIVE-1362.3.patch.txt,
HIVE-1362.4.patch.txt, HIVE-1362-gen_thrift.1.patch.txt, HIVE-1362-gen_thrift.2.patch.txt,
HIVE-1362-gen_thrift.3.patch.txt, HIVE-1362-gen_thrift.4.patch.txt
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message