db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bergquist, Brett" <Br...@canoga.com>
Subject RE: Question on cardinality, statistics, and when things go stale
Date Thu, 18 Nov 2010 19:27:10 GMT
Thanks for the explanation!  So if I understand this correctly, if the indexes that I am querying
on are always unique (the computed value that is index is always unique), then as long as
the optimizer knows about these statistics (ie I have run "syscs_util.syscs_update_statistics()"
after the table has a few rows), and my query is always like "select * from table where indexval
>= 'some starting value' and indexval < 'some ending value'" then it should always use
the index?

Again, thanks.


-----Original Message-----
From: Mike Matrigali [mailto:mikem_app@sbcglobal.net] 
Sent: Thursday, November 18, 2010 1:51 PM
To: derby-dev@db.apache.org
Subject: Re: Question on cardinality, statistics, and when things go stale

derby uses 2 types of statistics, for this discussion I will call them
distribution and cardinality.

For distribution derby uses the index themselves at query compile time,
so this info never goes stale.  For distribution the optimizer might
need to know what percentage of the keys in an index are between value1 
and value2.

Cardinality information can go stale and depends on the actual data in
the table.  Cardinality is used by the optimizer when the query does not 
have actual values at compile time (ie. ? operators). Cardinality is 
basically a single number that represents
the average number of duplicates per data value in the table.
In the case of a non unique index on a single column a logically it
will maintain one number that represents the average number of 
duplicates for a.  The stat can never be wrong for a unique index on
a single column as it is always 1 value per data value.
For a 2 column unique index on (a, b) a cardinality count will be
maintained on just the a values as they can be duplicate and it is 
possible to use the index for just qualifying on a.  No cardinality
count is necesary for queries that will provide both a and b as again
the cardinality count for that is a, b.

For many applications once a significant number of rows in
the table have been inserted it is enough to get the statistics once
and as rows come and go it does not matter as the average does not
change.  But of course it is easy to define an app that can break this.

Your question also depends on the query.   Assuming your index is on (a, 
b) and your query is something like select * from x where a = ? and b = 
? then the system is unlikely to ever not pick the index.  But usually 
the choice is not as clear.

Bergquist, Brett wrote:
> I have read as much as I can about this subject and am a little 
> confused.  I understand the cardinality statistics are not computed if 
> the tables are empty when indexes are created and I can work around that. 
> Say I have a table with 15 million records in it and there is a main 
> index that I need to be used to query and I update the statistics and 
> verify that query optimizer is using that index.   This table will have 
> many rows inserted and many rows deleted every day.  Basically the index 
> is a function of a timestamp and a counter.  So older records are going 
> to be deleted and newer records are going to be inserted.
> My question is will the query optimizer revert back to doing a table 
> scan at some point instead of using the index.  Basically will the index 
> go stale? 
> A second question is does a call to 
> "syscs_util.syscs_update_statistics'  lock a table/index while it is 
> being performed?  This tables of concern where the indexes might go 
> stale never have a free second in which inserts and deletes are being done....
> Thanks for any information.
> Brett

View raw message