hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth J (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-5369) Annotate hive operator tree with statistics from metastore
Date Thu, 14 Nov 2013 03:41:21 GMT

     [ https://issues.apache.org/jira/browse/HIVE-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Prasanth J updated HIVE-5369:
-----------------------------

    Attachment: HIVE-5369.6.patch.txt

[~rhbutani] Thanks for reviewing the patch. I addressed all your review comments in this patch.
Left some comments in RB too. Apart from the code review comments I added the following changes

1) JOIN rule missed the precondition check. Because of this even if parents column statistics
are not available it will try to apply the rule which will result in unexpected exceptions.
This is the reason for many test case failures in previous HIVE QA precommit run.
2) GROUPBY rule used to multiply the number of rows in case of grouping set. So when statistics
are updated, column statistics will also be multiplied by a factor. The number of distincts
should not be affected because of this multiplier. Added code to protect number of distincts
to not change when there is increase in number of rows.
3) FILTER rule where the predicate is a boolean column or NOT of boolean column is updated
to return numTrues and numFalses respectively. This will result in more accurate number of
rows than dividing by 2.
4) Added qfile test to check the effect of map-side parallelism for group by operator.
5) Removed all PTF related code.

Still other failing test cases are not fixed. Will look for HIVE QA failure report to regenerate
failing tests.

> Annotate hive operator tree with statistics from metastore
> ----------------------------------------------------------
>
>                 Key: HIVE-5369
>                 URL: https://issues.apache.org/jira/browse/HIVE-5369
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor, Statistics
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: statistics
>             Fix For: 0.13.0
>
>         Attachments: HIVE-5369.1.txt, HIVE-5369.2.WIP.txt, HIVE-5369.2.patch.txt, HIVE-5369.3.patch.txt,
HIVE-5369.4.patch.txt, HIVE-5369.5.patch.txt, HIVE-5369.6.patch.txt, HIVE-5369.WIP.txt, HIVE-5369.refactor.WIP.txt
>
>
> Currently the statistics gathered at table/partition level and column level are not used
during query planning stage. Statistics at table/partition and column level can be used for
optimizing the query plans. Basic statistics like uncompressed data size can be used for better
reducer estimation. Other statistics like number of rows, distinct values of columns, average
length of columns etc. can be used by Cost Based Optimizer (CBO) for making better query plan
selection. As a first step in improving query planning the statistics that are available in
the metastore should be attached to hive operator tree. The operator tree should be walked
and annotated with statistics information. The attached statistics will vary for each operator
depending on the operation it performs. For example, select operator will change the average
row size but doesn't affect the number of rows. Similarly filter operator will change the
number of rows but doesn't change the average row size. Similar rules can be applied for other
operators as well. 
> Rules for different operators are added as comments in the code. For more detailed information,
the reference book that I am using is "Database Systems: The Complete Book" by Garcia-Molina
et.al.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message