db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristian Waagan <kristian.waa...@oracle.com>
Subject Re: Question on why indexStat deamon is being triggered
Date Mon, 02 Apr 2012 18:38:12 GMT
On 02.04.2012 19:13, Mike Matrigali wrote:
> Kristian Waagan wrote:
>> On 31.03.12 22:28, Mike Matrigali wrote:
>>> Bergquist, Brett wrote:
>>>> I hitched up the debugger and caught the indexStat return "49" for the
>>>> number of rows. I walked the stack back and found this was for
>>>> conglomeratename = "f3ec4922-011d-491f-3d8c-0000376d74d3"
< snip >
>>> First thought is I didn't think we needed any statistics at all
>>> for a single column primary key.
>>
>> That's probably DERBY-3790 ([1]).
>> I'll see if I can write a quick patch proposal and then we can take 
>> it from there.
> With daemon DERBY-3790 is probably more important than when people 
> were just
> manually running it.  I think a good incremental change is just first
> deal with single column unique indexes.  In that case I think we should
> make there be no row in sysindexes and verify this is not a problem for
> the optimizer because it understands unique index means that cardinality
> stat is redundant.  I think this should just be handled in update 
> statistics code and then the daemon would not have to have special logic
> about what index to go after.

Thanks, Mike.
I haven't written any code yet, but this looks like something that is 
easily achievable with minor changes to the update statistics code.

>
>
> I have not looked, but brett reported seeing a non-unique index being
> created for a foreign key that matched the primary key.  I think this
> is a very normal pattern for database design.  So may be worth looking
> at as a next incremental step.  I think it might have to be a 
> non-unique index as at that level of the system we might not know that 
> the key
> is unique and always will be.  But there might be room for an 
> optimization here.  Maybe update statistics could determine that the
> non-unique index is currently based on a unique index and not do
> the scanning work, and instead just immediately add a stat row with
> number of unique rows equal to current number of rows.  I think this
> is easier than teaching the optimizer about this.  Again I would just
> go after the single column case.  Once there is more than one column
> we are going to scan whole index once for all stats, so not worth
> worrying about unique index case.

This one requires more investigation (I don't know exactly what the 
index descriptors look like yet for these), but I observed that in 
Brett's case the same conglomerate was scanned twice (seen from the 
istat output).

>>
>>>
>>> There should be entries in this catalog for every index on a table, so
>>> it is not necessarily expected to be one per table. I think for each
>>> index there is an entry for each column in the index (actually the stat
>>> applies to multiple columns so in a 3 column index I think there is an
>>> entry for (col1), (col1,col2), and (col1, col2, col3). But looking at
>>> your ddl I would only expect 1 entry, as there is one
>>> index on the table, and it has one column. I am not sure what the
>>> expectation is for the foreign key that matches the primary key, maybe
>>> it is 2 rows?
>>>
>>> I do wonder if the foreign key is causing some extra issues. I think
>>> in the long past we would end up creating 2 indexes in this case and
>>> they would be exactly the same. So functionality was added to just
>>> "logically" create the index and it would rely on the underlying
>>> physical index sort of "shared" by the primary key on id and the 
>>> foreign
>>> constraint on id. We have had bugs in this area in the past and wonder
>>> if it left an "orphan" row in the statistics.
>>
>> As mentioned in another post, I think there's an easy way to avoid 
>> scanning the same index twice, but I need to make sure the check is 
>> valid first.
>> I'm thinking of simply checking if we've seen the index conglomerate 
>> number already, and if so, ignore that index. The question is, can 
>> two truly different indexes be represented by the same conglomerate 
>> number
>>
>>>
>>> Interesting that the second column it looks like it was created in 
>>> 2010.
>>>
>>> seems like there might be multiple problems here. It would be great to
>>> figure out how the multiple rows got in there in the first place. I
>>> would suggest seeing what the system does for ddl in a brand new
>>> database, turn off index stat thread, run your ddl, add some rows to 
>>> the
>>> table, and hand run update statistics procedure.
>>
>> Just for information, the istat daemon and the system procedure share 
>> almost all of the core code at the moment. The addition of automatic 
>> updates of index cardinality statistics consisted mainly of adding 
>> the "daemon framework" and triggering logic.
> great, i would like that any fixes in the area of this bad row work for
> both update stats triggered by the daemon and those triggered by direct
> call by user.
>>
>>>
>>> I would not be surprised if existing code does not expect the number of
>>> rows you are seeing. If this is really a buggy orphan row, probably 
>>> best
>>> zero admin fix is to change the update statistics code to look for
>>> other rows and delete them.
>>> you can follow the update statistic code starting at:
>>> java/engine/org/apache/derby/impl/sql/execute/AlterTableConstantAction/updateStatistics

>>>
>>>
>>>
>>> There may be a index stat daemon bug here too, but I think it might 
>>> just
>>> be getting confused by the bad data in the catalog. Maybe it just keeps
>>> reading all the rows and queueing work, but calling update statistics
>>> only update 2 of the 3 rows.
>>
>> Yes, the triggering code is driven by the existing statistics for the 
>> table, whereas the update code is driven by the list of indexes on 
>> the table.
>>
> Does the daemon when it determines an update is necessary, go through 
> the udpate stat path telling it to update stats on all indexes or does
> it call update stat one index at a time?

It goes through the path where all indexes are updated.
This can be changed in two locations if we wish to:
  1) In the scheduling call (currently it takes a TableDescriptor as an 
argument)
  2) Inside the daemon itself, where we could (re-)check the status of 
each index before processing it.

When a statement is compiled the code checks the index cardinality 
statistics for staleness if there are qualifying indexes for the query. 
Once a stale index is found for a base table, the table descriptor is 
stashed aside to be submitted to the daemon as a unit of work (and we 
stop checking the rest of the indexes of that table).

Currently, the only time  a unit of work can be scheduled is when a 
select statement using an index is compiled/recompiled.


-- 
Kristian

>>
>


Mime
View raw message