db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mamta A. Satoor (JIRA)" <j...@apache.org>
Subject [jira] Commented: (DERBY-3788) Provide a zero-admin way of updating the statisitcs of an index
Date Mon, 26 Jul 2010 16:18:50 GMT

    [ https://issues.apache.org/jira/browse/DERBY-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892343#action_12892343

Mamta A. Satoor commented on DERBY-3788:

Just wanted to summarize what has been tried for this jira so far and a proposal for a possible
solution worth trying
1)First I tried to see if the missing stats could be created inline when the SELECT query
compilation finds that missing but that failed because SELECT query compilation marked the
data dictionary in read only mode where as update statistics needs to update the data dictionary(this
marking of read only for data dictionary happens in GenericStatement.prepMinion). Because
of this read-only marking, we can't create the statistics inline of the compilation phase
of the SELECT query. 
2)Second approach was during the SELECT compile phase, if we come across missing stats, then
schedule a task for later to update the stats. Finish the SELECT compilation and execution
after the update stat task is scheduled on the queue (the scheduling mechanism used is only
available with jdk 1.5 and higher). When the scheduled task was later fired, it would run
no current connection exception. The reason for that was that there was no context set for
update statistics to run. Tried resolving this by creating a connection context through following
call InternalDriver.activeDriver().connect(url, info) I had the SELECT compilation pass the
needed information to the background task so I can create the connection. This is pretty crude
way of getting a connection to update the statistics. Along with the fact that the code to
create the connection context was not very modular, some of the existing tests failed with
this changes because of locking issues. This probably is because the locks held by the background
thread for update stats interfered with locks required by the rest of the test. 
3)The third approach I tried was when the SELECT query detects missing stats, abort the query
compilation, (thus removing the data dictionary from read-only mode), create the stats and
then restart the original SELECT query compilation and execution. The hope with this approach
was that it will avoid running into locking issues. If we decide to pursue this approach more,
we need to make sure that we look at the possibility of stats still missing when the original
query is compile the 2nd time. This can happen say because the user does not have privileges
to update the stats, the update stat ran into locking issues. If we won't account for possibility
of missing stats during the 2nd time through the original query, we can end up in an infinite
loop. Another problem with this approach was that update stats were happening in the same
transaction as the original SELECT query. Which means that the locks acquired by the update
stat will not be released until the transaction is committed. Came across locking issues with
this when I ran the existing junit and derbyall tests. One way to fix this could be to start
a nested user transaction and do the update stat in that transaction and commit that transaction(I
think something like this is done for identity columns. If yes, then we might find pointers
there to use for update stats in nested user transaction). Thus any locks acquired by the
update stat work will be released. And after that, go back to the original query again for
compilation and execution. 
4)With any approach we take, we should detect the case where the table is empty and hence
the stats might be missing. For such a case, we should not try to fire stats creation task.
to rephrase, during the compile phase of the SELECT query, when we look for statistics, first
check if the table is empty and if yes then skip the code for collecting stats. From what
I recall, one can check if the table is empty by issuing open scan controller and asking it
for the number of rows in the table. If we do not check for empty table, then we will end
up in infinite loop. 
5)For read-only db we should detect that we are readonly and don't try to do update statistics

6)Also, we probably want to create statistics only for user tables? Haven't thought enough
about it. 
7)What if the user executing the query is not the owner of the table. should statistics still
be created? I think if the user does not have enough privileges, then skip the step of creating
stats since it's going to fail anyways. 
8)Once statistics are created, notify tabledescriptor about it because tabledescriptor  cahces
available statistics and it will never know of the new statistics since tabledescriptor holds
on to that cache. 
9)Once we have update stats working, another improvement could be to identify the cached compiled
queries that may benefit from updated stats and have those queries recompile when they are
executed next time. In other words, the relevant queries should be invalidated as part of
the statistics collection task. 
10)Look at other scenarios(in addition to step 7) above) where we know that update stats might
fail. For those cases, don't spend time trying to create the stats. Instead just let the optimizer
work with the information it has at the moment to come up with the query plan. 

> Provide a zero-admin way of updating the statisitcs of an index
> ---------------------------------------------------------------
>                 Key: DERBY-3788
>                 URL: https://issues.apache.org/jira/browse/DERBY-3788
>             Project: Derby
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions:
>            Reporter: Mamta A. Satoor
>            Assignee: Mamta A. Satoor
>         Attachments: DERBY3788_patch1_diff.txt, DERBY3788_patch1_stat.txt, DERBY3788_patch2_diff.txt,
DERBY3788_patch2_stat.txt, DERBY3788_patch3_diff.txt, DERBY3788_patch3_stat.txt, DERBY3788_update_all_missing_stats_after_bind_patch4_diff.txt,
DERBY3788_update_all_missing_stats_after_bind_patch4_stat.txt, DERBY_3788_Mgr.java, DERBY_3788_Repro.java
> DERBY-269 provided a manual way of updating the statistics using the new system stored
procedure SYSCS_UTIL.SYSCS_UPDATE_STATISTICS. It will be good for Derby to provide an automatic
way of updating the statistics without requiring to run the stored procedure manually. There
was some discussion on DERBY-269 about providing the 0-admin way. I have copied it here for
> *********************
> Kathey Marsden - 22/May/05 03:53 PM 
> Some sort of zero admin solution for updating statistics would be prefferable to the
manual 'update statistics' 
> *********************
> *********************
> Mike Matrigali - 11/Jun/08 12:37 PM 
> I have not seen any other suggestions, how about the following zero admin solution? It
is not perfect - suggestions welcome. 
> Along with the statistics storing, save how many rows were in the table when exact statistics
were calculated. This number is 0 if none have been calculated because index creation happened
on an empty table. At query compile time when we look up statistics we automatically recalculate
the statistics at certain threshholds - say something like row count growing past next threshhold
: 10, 100, 1000, 100000 - with upper limit being somewhere around how many rows we can process
in some small amount of time - like 1 second on a modern laptop. If we are worried about response
time, maybe we background queue the stat gathering rather than waiting with maybe some quick
load if no stat has ever been gathered. The background gathering could be optimized to not
interfere with locks by using read uncommitted. 
> I think it would be useful to also have the manual call just to make it easy to support
customers and debug issues in the field. There is proably always some dynamic data distribution
change that in some case won't be picked up by the automatic algorithm. Also just very useful
for those who have complete control of the create ddl, load data, run stats, deliver application
> *********************

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message