hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasad Chakka (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4488) [Hive]: Add ability to compute statistics on hive tables
Date Wed, 22 Oct 2008 19:11:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641947#action_12641947
] 

Prasad Chakka commented on HADOOP-4488:
---------------------------------------

some comments and questions

1- For each partition (or table for non-partitioned tables), we should store number of files
as well (so we can optimize on number of mappers)

2- We should make the number of bins optional and use default. We might need some trial and
error to figure out the optional number depending on number of distinct values/rowcount.

3- how do you do distinct values for floats? by rounding them or not store at all?

4- for string we could store stats for some prefix of the string?

5- in histograms, we should store number distinct values as well in the bucket.

6- can we store correlation between two columns?  it would help figuring out selectivity more
accurately.



> [Hive]: Add ability to compute statistics on hive tables
> --------------------------------------------------------
>
>                 Key: HADOOP-4488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4488
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/hive
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>
> Add commands to collect partition and column level statistics in hive.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message