db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mamta A. Satoor (JIRA)" <j...@apache.org>
Subject [jira] Updated: (DERBY-3788) Provide a zero-admin way of updating the statisitcs of an index
Date Wed, 05 Nov 2008 23:10:44 GMT

     [ https://issues.apache.org/jira/browse/DERBY-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mamta A. Satoor updated DERBY-3788:
-----------------------------------

    Attachment: DERBY3788_patch1_stat.txt
                DERBY3788_patch1_diff.txt

First of all, let me start out by saying that this patch may not be ready for commit yet depending
on what the community thinks about the approach taken to establish a Connection in a background
thread to fire update statisitcs jobs. 

Background information
There can be situations in Derby where the statistics might not exist for some of the indexes.
The absence of statistics will hamper optimizer for picking up the best available path for
query execution. We have added a manual way of updating the statistics through DERBY-269.
But this is a manual process rather than zero-admin way of mainting statistics. I am attempting
to have Derby fire the update statistics job in the background when it finds that statistics
are not available for an index. Derby can detect the unavailability of statistics during query
optimization phase where it queries SYSSTATISTICS table to see if statistics are available
for an index. With my patch, if the statistics are missing, then we will schedule a background
task to have the statistics available for next time around when optimizer needs to look at
them. To schedule these tasks, rather than inventing our own way of managaing threads, I have
relied on new classes available in jdk1.5 This means that the zero-admin way of maintaining
statistics will be available in jdk 1.5 and higher versions. In order to make the update statistics
funtionality available in jdk1.5 and higher, I had to subclass a new version of DataDictionaryImpl
called DataDictionaryImpl5 and that class will be used only for jdk1.5 and higher. The background
task creates an EmbedConnection of it's own and executes following sql on the JDBC Statement
obtained from the EmbedConnection.
CALL SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(schemaName, tableName, indexName);

This is where I would appreciate if the Derby community can see if my approach is the best
way to execute update statistics ie by creating an EmbedConnection. What's even trickier is
the way I get the correct instance of InternalDriver, the url and properties that were used
to make the original JDBC connection which required us to fire this update statistics task.
The code may not look very clean in how get these objects which are required to make an EmbedConnection.
I tried looking through existing Derby code to see if we are required to execute our own sql
in an independent thread but I didn't find any luck with that. So, please let me know if you
agree with the approach or have ideas on how to achieve this in a different way, I will really
appreciate it. The work of getting the required objects to create an EmbedConnection happens
in DataDictionaryImpl5.

Please let me know if I can provide more information to make the patch more clear. I haven't
put much comments in the new code in the patch because I wanted to sure that we as a community
agree on the approach taken to do the actual work of update statistics by creating EmbedConnection
object.

I did try running the existing junit tests with my patch and found handful of tests to fail.
I analyzed few of those failures and I think they are locking failures. I have a feeling that
when we try to update the statistics, we hold the locks which conflict with what the test
might be doing at the same time that statistics are getting updated in a different thread.
I wonder what should we do about these failures.


> Provide a zero-admin way of updating the statisitcs of an index
> ---------------------------------------------------------------
>
>                 Key: DERBY-3788
>                 URL: https://issues.apache.org/jira/browse/DERBY-3788
>             Project: Derby
>          Issue Type: New Feature
>          Components: Performance
>    Affects Versions: 10.5.0.0
>            Reporter: Mamta A. Satoor
>            Assignee: Mamta A. Satoor
>         Attachments: DERBY3788_patch1_diff.txt, DERBY3788_patch1_stat.txt, DERBY_3788_Mgr.java,
DERBY_3788_Repro.java
>
>
> DERBY-269 provided a manual way of updating the statistics using the new system stored
procedure SYSCS_UTIL.SYSCS_UPDATE_STATISTICS. It will be good for Derby to provide an automatic
way of updating the statistics without requiring to run the stored procedure manually. There
was some discussion on DERBY-269 about providing the 0-admin way. I have copied it here for
reference.
> *********************
> Kathey Marsden - 22/May/05 03:53 PM 
> Some sort of zero admin solution for updating statistics would be prefferable to the
manual 'update statistics' 
> *********************
> *********************
> Mike Matrigali - 11/Jun/08 12:37 PM 
> I have not seen any other suggestions, how about the following zero admin solution? It
is not perfect - suggestions welcome. 
> Along with the statistics storing, save how many rows were in the table when exact statistics
were calculated. This number is 0 if none have been calculated because index creation happened
on an empty table. At query compile time when we look up statistics we automatically recalculate
the statistics at certain threshholds - say something like row count growing past next threshhold
: 10, 100, 1000, 100000 - with upper limit being somewhere around how many rows we can process
in some small amount of time - like 1 second on a modern laptop. If we are worried about response
time, maybe we background queue the stat gathering rather than waiting with maybe some quick
load if no stat has ever been gathered. The background gathering could be optimized to not
interfere with locks by using read uncommitted. 
> I think it would be useful to also have the manual call just to make it easy to support
customers and debug issues in the field. There is proably always some dynamic data distribution
change that in some case won't be picked up by the automatic algorithm. Also just very useful
for those who have complete control of the create ddl, load data, run stats, deliver application
process. 
> *********************

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message