hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-3917) Support fast operation for analyze command
Date Tue, 29 Jan 2013 17:09:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565530#comment-13565530
] 

Ashutosh Chauhan edited comment on HIVE-3917 at 1/29/13 5:09 PM:
-----------------------------------------------------------------

I think Shreepadma's initial question about possibility of gathering other stats (like # of
rows, top-K, min, max etc.) needs more thought. E.g., some fileformats contain these stats
with themselves (like ORC ( HIVE-3874 ), HFile of Hbase etc) and can return some of these
stats just by reading metadata contained in the file headers. Though we need not to implement
them right away but we should keep that in mind while designing this. That's why I didn't
like {{noscan}} every much since in these cases you can collect stats very fast with partial
scans (i.e., just by asking FileFormat which will read metadata from file, instead of full
scan of file). 
                
      was (Author: ashutoshc):
    I think Shreepadma's initial question about possibility of gathering other stats (like
# of rows, top-K, min, max etc.) needs more thought. E.g., some fileformats contain these
stats with themselves (like ORC ( HIVE-3784 ), HFile of Hbase etc) and can return some of
these stats just by reading metadata contained in the file headers. Though we need not to
implement them right away but we should keep that in mind while designing this. That's why
I didn't like {{noscan}} every much since in these cases you can collect stats very fast with
partial scans (i.e., just by asking FileFormat which will read metadata from file, instead
of full scan of file). 
                  
> Support fast operation for analyze command
> ------------------------------------------
>
>                 Key: HIVE-3917
>                 URL: https://issues.apache.org/jira/browse/HIVE-3917
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>    Affects Versions: 0.11.0
>            Reporter: Gang Tim Liu
>            Assignee: Gang Tim Liu
>         Attachments: HIVE-3917.patch.1
>
>
> hive supports analyze command to gather statistics from existing tables/partition https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ExistingTables
> It collects:
> 1. Number of Rows
> 2. Number of files
> 3. Size in Bytes
> If table/partition is big, the operation would take time since it will open all files
and scan all data.
> It would be nice to support fast operation to gather statistics which doesn't require
to open all files:
> 1. Number of files
> 2. Size in Bytes
> Potential syntax is 
> ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS
[noscan];
> In the future, all statistics without scan can be retrieved via this optional parameter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message