db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Army <qoz...@gmail.com>
Subject Re: [ optimizer ] Some questions on optimization cost estimates?
Date Tue, 04 Apr 2006 00:19:33 GMT
Army wrote:
>> So my question is this: if two cost estimates have the same 
>> estimatedCost but different rowCounts (or even singleScanRowCounts), 
>> it is reasonable to suppose that the estimate with the lower rowCount 
>> is actually better?

Mike Matrigali wrote:
> I would think it would be better to figure out why the costs are the 
> same for these vastly different plans.  Seems like a problem with 
> costing.  It seems cleaner to me compare based on one field.  On the
> face of this there is some bad assumption in building the cost - maybe
> it is the 0 cost per hash look up you also mentioned?  What were the 
> actual 2 plans?

Thanks for the input, Mike.  In looking more at the query plans to answer your 
questions, I think I ended up stumbling into an explanation of why I'm seeing 
this behavior.

I looked some more at the plans and the only difference is join order.  For 
context, here are the queries again:

select * from V1, V2 where V1.j = V2.b;
select * from V2, V1 where V1.j = V2.b;

where V1 and V2 are both unions of two tables, with V2 having 100,000 rows and 
V1 having 7 (seven) rows.

In both cases the optimizer chooses to do a Hash join between V1 and V2.  In the 
first case V1 is inner table; in the second, V2 is the inner table.  That said, 
the cost calculation for a JoinNode, as found in JoinNode.optimizeIt(), is:

	/*
	** We add the costs for the inner and outer table, but the number
	** of rows is that for the inner table only.
	*/
	costEstimate.setCost(
		leftResultSet.getCostEstimate().getEstimatedCost() +
		rightResultSet.getCostEstimate().getEstimatedCost(),
		rightResultSet.getCostEstimate().rowCount(),
		rightResultSet.getCostEstimate().singleScanRowCount());

Thus even though the cost will be the same regardless of which "table" comes 
first, the rowCount will always be that of the inner table.  In this case, the 
optimizer guesses "20" for the row count of V1 and "100,016" for the row count 
of V2.

So that explains why we have the same cost but different row counts.

As to why the performance is so much worse when V2 is the inner table, I think 
it's because we try to materialize all 100,000 rows into a Hash table, which (I 
would guess?) spills over to disk and causes us to use a backing store hash 
table.  But use of the backing store hash table isn't yet factored into 
optimizer cost estimates.  Instead, the optimizer does a check to see if the 
memory required for the hash table is "excessive" (which translates to the fact 
that it will (probably) be too large for memory and require a backing store 
hash) and, if so, it skips the plan (assuming there is another feasible plan 
that does not require so much memory).

That said, when we specify "V1, V2" in the query, the check for excessive memory 
occurs as it should.  But when we specify "V2, V1" the check for excessive 
memory returns incorrect results--i.e it doesn't realize that we're going to be 
using so much memory, and so we do not skip the join order like we should.  As 
it turns out, this incorrect behavior is caused by a bug in the 10.1 code that 
has been fixed in 10.2 as Phase 1 of DERBY-805.  So in short, I was using 10.1 
to try to analyze the behavior *without* my DERBY-805 changes, and it turns out 
that the incorrect behavior I'm seeing is actually fixed as part of the 
DERBY-805 Phase 1 patch.  Oops.  And just to make sure, I did verify this with 
the 10.2 codeline (after disabling the predicate pushdown for unions) and the 
optimizer does indeed choose the same join order regardless of whether I give 
"V1, V2" or "V2, V1".

 > It seems cleaner to me compare based on one field.

Okay, I'm fine with leaving the compare as a one-field check for now.  I do 
however think this means that we can still end up with the optimizer choosing 
different plans depending on the order in which by the user specifies the FROM 
list elements.  In cases where one join order requires excessive memory we will 
(as of Phase 1 for DERBY-805) correctly skip that join order regardless of how 
the user specified the FROM list.  But in cases where memory is not an issue, I 
think the order of the FROM list in the user's query can still affect the plan 
chosen by the optimizer.  My gut instinct is that that seems odd, but I guess I 
need to look at it some more...

Thanks to Mike for prodding me in the right direction so that I could understand 
this more.

And if anyone has input for the other 2 questions I asked on this thread, I'm 
all ears... :)

Army


Mime
View raw message