db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: question about istats thresholds
Date Wed, 16 Feb 2011 18:17:53 GMT
Kristian Waagan wrote:
> On 15.02.11 23:15, Mike Matrigali wrote:
>> Thanks for taking a look at this.    As you point out the row 
>> estimates are particularly hard to count on for very small number of 
>> rows.  After the table gets big the
>> other params seem ok until more feedback.  I wonder if there should be
>> just be some minimum table sizes rather than "diff" sizes.  I would lean
>> toward defaulting to not running stats on a table that has stats unless
>> it is over some minimum size - say 1000 rows.
> 
> First, when you refer to table size, do you mean the table row estimate?
> 
Yes, I mean the current row estimate that the table itself has.
>>
>> So behavior would be:
>>
>> If we make this change then expected behavior would be:
>> o no scheduled istat runs for tables with no indexes
> 
> As it is today.
> 
>> o no scheduled istat run for a table with indexes with under 100 rows.
> 
> As it is today - this controls creation of new statistics.
> If we introduce a lower limit of 1000 rows for updating statistics, 
> would you consider increasing the creation limit to 1000 rows as well?
 From the updatelocks test it looks like istats does run for tables with
less than 100 rows today.

I would be ok with changing the behavior to default to create and update
only if the table has more than 1000 rows.  What I am thinking for 
defaults in the first release is that we should err on updating less 
often and see how that works in the field, thus less likely to adversely 
affect applications.  What I most want to avoid is a case where the 
background stats is consistently running for some wierd application mix. 
  The mostly likely case I see currently where this might happen is with
small tables.  We can increase how often stat are updated later if we 
see evidence that it would be better.
> 
>> o no scheduled istat runs for a table with indexes and existing stats 
>> with under 1000 rows.
> 
> This would be new behavior.
> 
>> o tables with over 1000 rows get stats based on current logic.
> 
> As it is today, but extended with a cutoff value (i.e. lower limit of 
> 1000 rows).
> 
> 
> To stress the point, what we are now talking about concerns small 
> tables. Inaccuracies in the row estimate for larger tables are less 
> likely to trigger stat updates due to the logarithmic comparison.
> 
> One question about at the end:
> If we introduce the 1000 rows lower limit for stats update, what will 
> happen with the stats on a table that shrinks from 1 million rows to 100 
> rows in one go?
As you point out in this corner case the system will not automatically
update the statistics.   Note that number of rows is not exact in 
determining if we should update statistics, just a guess.  The stat that 
is being maintained is basically the average number of duplicates for a 
given key (a little more complicated for multi-key indexes).  So
the stat for the million row table may be exactly right or might not be.
Statistics are less important to derby than other db's, since the 
optimizer uses the indexes themselves in many cases where other db's 
count on static histograms.

Another edge case that we don't handle is if the system changes the
key values either using all updates or inserts/deletes keeping the total
size of the table in a narrow range.  Again the stat may still be 
correct or not.

In the future it might be interesting to see if there are better 
indicators for need to rerun stats than number of rows.  Especially for
very large tables where the cost of redoing the stat is high, but the 
cost of bad plan is also equally high.  Some other factors that could
be included might be:
o date of last statistics
o measure of key change since last statistics
o analysis during execution of a query recognizing when estimate was bad
o Some light weight statistics gathering where you take samples rather 
than scan the whole table and compare that with estimates to see if you 
are out of date.


> This may be a corner case, but updating the stats once for a small table 
> is cheap. On the other side, the suboptimal plan the optimizer may come 
> up with may not be that much more expensive to execute for a small table 
> with less than 1000 rows either?
> 
> 


Mime
View raw message