db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: How to update cardinality statistics
Date Wed, 19 Oct 2005 18:36:11 GMT
The histogram information I am talking does not cause overhead at insert
time.  We use the actual index itself for the histogram.  The overhead
exists at compile time.

At compile time the optimizer will ask store a histogram question like
how many rows do you estimate exist such that id > 5 and id < 100.  To
answer this question store does a left search in the tree for 5 and
a search for 100 and then estimates from where those nodes lie in the
tree how many rows are in between.  This information along with row
counts is 90% of the stat used by Derby.

There is one other statistic that currently is only gathered when
an index is created either explicitly or part of a complete table
reorg.  That statistic tells average number of duplicates for a
key or a leading sequence of columns in a key.  The key must
have an index associated with it.

There have been some discussion on the mail list on how to better
automatically update this stat.

Michael J. Segel wrote:
> On Wednesday 19 October 2005 04:04, Øystein Grøvlen wrote:
> 
>>>>>>>"MM" == Mike Matrigali <mikem_app@sbcglobal.net> writes:
>>
>>    MM> I 2nd Satheesh's query, it is useful to know why you care.  Derby
>>    MM> unlike most other database's automatically maintains histogram
>>    MM> type information about the tables (this does require indexes to
>>    MM> exist).  This information is gathered directly from the indexes,
>>    MM> and is automatically maintained by Derby.
>>
>>    MM> The row count is also automatically maintained by Derby - whether
>>    MM> an index exists or not.
>>
>>Sounds good, but I think read somewhere in the Derby documentation
>>that some statistical information was only updated when a query did a
>>sequential scan of the table.
> 
> Well that doesn't make sense in light of what Mike wrote in a different 
> e-mail.
> 
> He indicated that Derby used information gathered by an index that the 
> DBA/user created on a table.
> 
> If I understood correctly, suppose I created the following
> 
> Table foo that had two columns: id an integer , and bar a char(25) field. I 
> then create an identity index on foo.id.
> 
> Derby would utilize the identity index to automatically update the table 
> statistics.
> 
> If you think about it, it kind of makes sense to do this. I mean after all, on 
> an identity index (non null, all distinct and unique), the number of nodes 
> would represent the number of rows in the table.
> 
> Of course it would also imply some overhead on performing the inserts. No?
> (There is no such thing as a free lunch.  Any sort of hands free automation 
> will come at a cost.)
> 
> I haven't thought a lot about it, but shooting from the hip, I would imagine 
> that the cost of this "optimization" would increase as the tables increased 
> in size by a log(N) factor. So Derby takes a hit when the database gets 
> large.
> 
> This of course is assuming I understood Mike's comment.
> 
> -G
> 


Mime
View raw message