hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Thusoo (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4488) [Hive]: Add ability to compute statistics on hive tables
Date Wed, 22 Oct 2008 18:57:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641939#action_12641939
] 

Ashish Thusoo commented on HADOOP-4488:
---------------------------------------

Type of statistics:
The following types of statistics can be collected on hive partitions ->

For each partition of the table:
1. Number of Rows
2. Size of the partition
3. Average size of a row
4. Number of blocks

For a column in the partition:
1. Number of distinct values
2. Number of null values
3. minimum 3 values
4. maximum 3 values
5. Histogram: Frequency histogram or a height balanced histogram (the former has equi range
bins while the later has the same height for all the bins)

The column level statistics could also be calculated for distributions in an average block

Language Elements:
ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS - this computes
the partition level statistics
ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS FOR ALL COLUMNS
SIZE n - this computes the column level statistics for all columns with n being the number
of bins in the historgram
ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS FOR COLUMNS SIZE
m c1 SIZE n1, c2 SIZE n2, c3 - this computes the column level statistics for columns c1 (using
n1 bins for the histogram), c2(using n2 bins) and c3 (using the defaut m bins)

We can later extend these so that these commands can work on samples and be able to extrapolate
the results to the entire data set. For that we could use the ESTIMATE STATISTICS SAMPLE n
ROWS or ESTIMATE STATISTICS SAMPLE n%

e.g.

ANALYZE TABLE <t> PARTITION(<partitionspec>) ESTIMATE STATISTICS 10%

More details on the actual implementation to follow...


> [Hive]: Add ability to compute statistics on hive tables
> --------------------------------------------------------
>
>                 Key: HADOOP-4488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4488
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/hive
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>
> Add commands to collect partition and column level statistics in hive.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message