[ https://issues.apache.org/jira/browse/HIVE1938?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12989084#comment12989084
]
bharath v commented on HIVE1938:

I have a few basic ideas about join costs. Consider the case of Common Join(We can extend
it to other joins) for which I propose a mathematical approach.
We can consider disk IO and Communication cost as the major bottlenecks and optimize the join
based on these factors.The challenge here is in predicting the size of the intermediate tables
output ( while joining more than 2 tables).
Suppose the there are "N" machines in the HiveHadoop cluster.
Suppose we are joining 2 table "A" on column "a" with table "B" ob column "b".Table A contains
n1 rows and table B has n2 rows. Suppose A.a has "d1" distinct values and B.b has "d2" distinct
values and suppose d1<d2 without loss of generality. The cost computation of a single join
can be divided into 3 phases
1) Map phase :
Both the tables are scanned from the disks once and we can assume that this is done parallely
on N machines. So total disk IO cost is estimated as O((n1+n2)/N) (Assuming constant page
size)
2) Shuffle Phase:
The cost estimation during this phase depends on the type of join. In the worst case every
row gets transferred to a reducer on different machine (other than the one on which it resides).
This results in movement of every row to a different machine . Assuming some constant pingtime
(that we get after making some avg calculations) the cost is O(k*(n1+n2)) where this k is
"cost/unittransfer" of data. Actual shuffle costs may be less than this due to "maximizing
locality" effect in hadoop MR. Iam just providing an upper bound and we can surely improve
upon this depending on the type of join.
For eg: Consider mapside join we can add the cost of moving the smaller table to all the
mappers = O(k*N*N3) where N3 is no. of rows in the smaller table.
3) Reduce phase :
In each reduce we just do a simple nestedloop join. Assuming "AVERAGE DISTRIBUTION" of data,
we get (n1/d1) rows of "A" and (n2/d2) rows of "B" per a single call of reducer and cost for
this is O(n1/d1*n2/d2) and there can be atmax d1/N sequential reducers running . So multiplying
that factor, the total cost is O((n1*n2)/(N*d2)).
The above cost is as a result of assuming uniform distribution of data per table. We can improve
this by maintaining histograms of statistics per node by which we can improve upon this calculation.
After calculation of join cost we need to predict the size of resultant join table so that
it's statistics can be used in continuing the above procedure for the rest of the tables.
Estimating the result table size:
The resultant table will have O(n1*n2/d1) rows assuming that the distribution of A.a and B.b
's distinct values is same. This is where we can improve a great deal by maintaining good
statistics so that we can predict correct size of the resultant table.
We can follow a simple systemR approach of dynamic programming where we choose the bestones
in each iteration using the above cost formulae.
This is a pretty long post already .. This is the basic approach and we can improve upon this
further depending on the feedback I get.
I presented this work as a poster at ACMSIGMOD 2010 and I got some positive feedback.
> Cost Based Query optimization in Hive
> 
>
> Key: HIVE1938
> URL: https://issues.apache.org/jira/browse/HIVE1938
> Project: Hive
> Issue Type: Improvement
> Components: Query Processor
> Environment: *nix,java
> Reporter: bharath v
>
> Current optimization in Hive is just rulebased and involves applying a set of rules
on the Plan tree. This depends on hints given by the user (which may or maynot be correct)
and might result in execution of costlier plans.So this jira aims at building a costmodel
which can give a good estimate various plans before hand (using some metadata already collected)
and we can choose the best plan which incurs the least cost.

This message is automatically generated by JIRA.

For more information on JIRA, see: http://www.atlassian.com/software/jira
